Intro

Project overview

In this case study, I am a junior data analyst who is working for the marketing analytics team at Bellabeat, a high-tech company that designs health tracking products for women. This hypothetical scenario is provided by Google’s Data Analytics Certificate Program through Coursera, and I will be outlining the standard data analysis pathway throughout this project (ask, prepare, process, analyze, share, and act).

Ask

Business task

My main objective for this project is the following: Based on trends in non-Bellabeat smart devices for tracking health data, what trends occur among consumers and how can they be applied to a Bellabeat device and the company’s marketing strategy?

Stakeholders

After my analysis, I would share my high-level findings and provide suggestions to improve a Bellabeat device to the main stakeholder, Urška Sršen, one of the two founders of Bellabeat. I would also share my findings and more detailed analyses and code with the data analytics team prior to sharing high-level results with the founder.

Prepare

Non-Bellabeat Data

The data for non-Bellabeat devices was downloaded from Kaggle user Möbius: Fitbit Fitness Tracker Data. This data collection contains 18 CSV files about exercise activity, intensity, calories, steps, METS (metabolic equivalent of task), heartrate, weight, and sleep across different time measurements (days, hours, minutes, seconds).

The original datasets were created by Furberg, R. et al, published on Zenodo in May 2016. It is open-sourced with a Creative Commons Attribution 4.0 International Public License. The data is based on a 2016 crowdsourced survey of Fitbit users, conducted by Amazon Mechanical Turk.

library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
── Attaching packages ──────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.5     ✓ dplyr   1.0.7
✓ tidyr   1.1.4     ✓ stringr 1.4.0
✓ readr   2.0.2     ✓ forcats 0.5.1
── Conflicts ─────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Load datasets

When loading data, I wanted to name the dataframes with meaningful names. To simplify the file names, I used the following key for time measurements:

  • daily - d
  • hourly - h
  • minutes - m
  • seconds - s
  • Narr - Narrow
  • Wide - Wide

Dataframe names begin with a time measurement, followed by a period for readability, and a description of the health measurement.

# I commented out dataframes that won't be used in this analysis.
# To view dataframes, uncomment each View(dataframe_name) and re-run code

# activity
d.activity <- read_csv("./data/dailyActivity_merged.csv", show_col_types = FALSE) 
# View(d.activity)
# d.activity contains the data found separately in d.intensities, d.calories, d.steps

# intensities
# d.intensities <- read_csv("./data/dailyIntensities_merged.csv", show_col_types = FALSE)
# View(d.intensities)
h.intensities <- read_csv("./data/hourlyIntensities_merged.csv", show_col_types = FALSE)
#View(h.intensities)
# m.intensitiesNarr <- read_csv("./data/minuteIntensitiesNarrow_merged.csv", show_col_types = FALSE)
#View(m.intensitiesNarr)
# m.intensitiesWide <- read_csv("./data/minuteIntensitiesWide_merged.csv", show_col_types = FALSE)


# calories
# d.calories <- read_csv("./data/dailyCalories_merged.csv", show_col_types = FALSE)
# View(d.calories)
h.calories <- read_csv("./data/hourlyCalories_merged.csv", show_col_types = FALSE)
# View(h.calories)
# m.caloriesNarr <- read_csv("./data/minuteCaloriesNarrow_merged.csv", show_col_types = FALSE)
# m.caloriesWide <- read_csv("./data/minuteCaloriesWide_merged.csv", show_col_types = FALSE)
# View(m.caloriesWide)

# steps
# d.steps <- read_csv("./data/dailySteps_merged.csv", show_col_types = FALSE)
# View(d.steps)
h.steps <- read_csv("./data/hourlySteps_merged.csv", show_col_types = FALSE)
# View(h.steps)
# m.stepsNarr <- read_csv("./data/minuteStepsNarrow_merged.csv", show_col_types = FALSE)
# View(m.stepsNarr)
# m.stepsWide <- read_csv("./data/minuteStepsWide_merged.csv", show_col_types = FALSE)
# View(m.stepsWide)

# METs - metabolic equivalent of task (relates to intensity)
# minuteMETS <- read_csv("./data/minuteMETsNarrow_merged.csv", show_col_types = FALSE)
# View(minuteMETS)

# heartrate
# s.heartrate <- read_csv("./data/heartrate_seconds_merged.csv", show_col_types = FALSE)

# weight
# weightLog <- read_csv("./data/weightLogInfo_merged.csv", show_col_types = FALSE)
# View(weightLog)

# sleep
sleepDay <- read_csv("./data/sleepDay_merged.csv", show_col_types = FALSE)
# View(sleepDay)

After loading the data in R, I noticed some column values, like distance or METS were vague in meaning. For instance, I wondered, what measurement is the Fitbit tracker using for total distance? I found the metadata for the FitBase files, the Data Dictionary, on the FitBase website. This metadata has explanations on what each column value means. Turns out, the total distance is measured in kilometers.

Bellabeat Device – Time

While the loaded data is a lot of files, I will only be analyzing specific health measurements for the scope of this project – in particular, those that relate to the Bellabeat device of choice. After reviewing Bellabeat’s products, I chose to focus on Time, a health and wellness tracking watch for:

  • Active time
  • Sleep time
  • Mindfulness
  • Stress resistance
  • Menstrual cycle

Additional benefits of the watch include wireless connectivity and syncing to the Bellabeat app, meditation and hydration tracking, and a wellness score calculation, according to the website.

I wanted to examine trends for Fitbit data on activity and sleep because those health measures are relevant to the Time watch. Of the datasets, I incorporated the following in my analysis:

  • d.activity
  • h.intensities
  • h.calories
  • h.steps
  • sleepDay

These datasets are in wide format and saved as CSVs in the [data] folder within the overall project folder.

For inspiration on ways to analyze and visualize Fitbit tracking data, I referred to Yash Soni’s tutorial, published on freeCodeCamp. The main difference with Soni’s data is that it is specific to one individual, whereas the sample Fitbit data contains activity data for up to 33 different users, each assigned with a unique Id. But based on the tutorial, ways I can analyze data include by day of the week, using boxplots or bar graphs, or using a correlation matrix.

In addition to tidyverse, I used the following R packages:


library(lubridate) # wrangling dates

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
library(aweek)# adding week numbers to dates
library(janitor) # cleaning dataframe column names

Attaching package: ‘janitor’

The following objects are masked from ‘package:stats’:

    chisq.test, fisher.test
library(ggcorrplot) # correlation matrix
library(scales) # visualizing percentage scales

Attaching package: ‘scales’

The following object is masked from ‘package:purrr’:

    discard

The following object is masked from ‘package:readr’:

    col_factor

Daily Activity

Process – Cleaning daily activity data

1. Adding day of week and Id2 to d.activity

Since I planned to examine trends by day of the week, I used the weekdays() function in base R to convert each date into the day of the week (ex: Monday). For easier readability of Id numbers, I reassigned each Id user to a letter or a letter pair – known as Id2.

d.activity$date <- as.Date(d.activity$ActivityDate, format="%m/%d/%Y")
#View(d.activity)

# extract week day
d.activity$Day <- weekdays(d.activity$date)

li.wkdays <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
li.wkends <- c("Sunday", "Saturday")
li.daysofwk <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

d.activity <- d.activity %>% mutate(wkdayend = case_when(Day %in% li.wkdays ~ "weekday", Day %in% li.wkends ~ "weekend"))
d.activity$Id <- as.character(d.activity$Id)
#View(d.activity)

# Pair each Id with a unique letter Id2. LETTERS is "A":"Z". As there are 33 users and only 26 alphabet letters, add 7 more unique letter pair as Id2.
Id2 <- append(LETTERS, c("ZA", "ZB", "ZC", "ZD", "ZE", "ZF", "ZG")) # new Ids
Id.unique <- d.activity %>% select(Id) %>% distinct() # distinct Ids
Id.new <- data.frame(Id.unique, Id2)
#View(Id.new)

# left join to d.activity
d.activity <- merge(x=d.activity, y=Id.new, by="Id", all.x=TRUE)
#View(d.activity)

2. A visual overview of d.activity

I plotted a correlation matrix to visualize potential correlation relationships between data value columns. I referenced Alboukadel Kassambara’s documentation of the ggcorrplot package, which is also on GitHub.

Sedentary minutes and distance have little to no correlation with calories, as expected. For each active intensity (very, moderate/fair, light), the minutes and distance have a positive correlation, as do the total distance and tracker distance. When analyzing the data, I will focus on the active exercise and filter out the sedentary minutes and distance.

Next, I wanted to know, how consistent are users recording their daily activity and using the Fitbit tracker? The number of entries varied between users, which can be seen when plotting the date and TotalDistance for each user in the scatter plot. I isolated the top 30 users with the most data.

# corr matrix for d.activity using GGally library
d.activity_sel <- d.activity %>% select(TotalDistance: TrackerDistance, VeryActiveDistance: Calories)

mtrx.activity <- round(cor(d.activity_sel), 1) # correlation matrix
pvals.activity <- cor_pmat(d.activity_sel) # p-values matrix

corr.activity <- ggcorrplot(mtrx.activity, outline.color = "gray28", type="lower", legend.title = "correlation") +
  ggtitle("Correlation of Daily Steps, Minutes, Distance, and Calories") +
  labs(subtitle = "Fitbit daily activity data for 33 users from April 12 to May 9, 2016", caption="Source: Furberg et al. 2016. Zenodo.") +
    theme(plot.title = element_text(face="bold"),
          plot.title.position= "plot",
          plot.caption.position = "plot")
corr.activity


# ggsave("./plots/corr.activity.png" , width = 8, height = 5, dpi=200)

 
# How consistent is the data for users?
userActivityCount <- d.activity %>% group_by(Id2) %>% 
  summarize(Count = n()) %>% arrange(desc(Count))
#View(userActivityCount)

# Plotting the date and TotalDistance for each user shows some users stopped data collection before May 09, 2016, the last day.
scatter.dateDistance <- d.activity %>% ggplot(aes(x = date, y=TotalDistance)) +
  geom_point(aes(color = Calories), stat="identity") +
  ggtitle("Some Fitbit users did not record distance for the full time period") +
  labs(subtitle = "Activity data was recorded Tuesday, April 12, 2016 to Thursday, May 12, 2016", caption="Source: Furberg et al. 2016. Zenodo.") +
  scale_x_date(date_breaks="2 weeks", date_labels="%b %d") +
  guides(color = guide_legend(reverse=TRUE)) +
  facet_wrap(~Id2, nrow=5) +
  theme_minimal() +
  theme(axis.text.x= element_text(hjust=0.5, vjust=0.3),
        axis.title.x = element_text(vjust = -0.5),
        plot.title = element_text(face="bold"), 
        plot.caption.position= "plot")
 
scatter.dateDistance

# ggsave("./plots/scatter.dateDistance.png" , width = 9, height = 5, dpi=300)


# Determine top 30 users with most daily data available
userActivity30 <- userActivityCount %>% top_n(30, Count) %>% glimpse()
Rows: 30
Columns: 2
$ Id2   <chr> "A", "B", "D", "E", "F", "G", "H", "J", "M", "O", "P", "Q…
$ Count <int> 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 3…
#(userActivity30)

Share – Daily activity by day of the week

The violin plots show a broader distribution in steps and distance on the weekends than weekdays, and Tuesdays have a greater median of steps and total distance.


top30.activity <- d.activity %>% filter(Id2 %in% userActivity30$Id2)
#View(top30.activity)

count_dates <- top30.activity %>% group_by(date) %>% count() %>% glimpse()
Rows: 31
Columns: 2
Groups: date [31]
$ date <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-1…
$ n    <int> 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30…
# isolate first two full weeks (starting from Sunday)
top30.activity.fullwks <- top30.activity %>% filter(between(date, as.Date("2016-04-17"), as.Date("2016-04-30")))

check1 <- top30.activity.fullwks %>% group_by(Id2) %>% count()
#View(check1)

#View(top30.activity.fullwks)
# steps by day
vio.stepsDay <- top30.activity.fullwks %>% ggplot(aes(x=Day, y=TotalSteps)) +
  geom_violin(aes(fill= wkdayend), alpha=0.5, width=0.8) +
  geom_boxplot(width =0.3, aes(fill=wkdayend), alpha=0.4) +
  ggtitle("Wider distribution of total steps on weekends than weekdays ", subtitle="Greater median steps on Tuesdays, of the weekdays") +
  scale_x_discrete(limits = li.daysofwk) +
  scale_y_continuous(breaks = seq(0, 28000, 7000)) +
  labs(x ="day", y = "steps", fill="day type", caption = "Fitbit data of 30 users for two weeks (Sun, April 17 to Sat, April 30, 2016) \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title= element_text(face="bold"))
vio.stepsDay


# ggsave("./plots/vio.stepsDay.png" , width = 8, height = 5, dpi=300)


# distance by day
vio.distanceDay <- top30.activity %>% ggplot(aes(x=Day, y=TotalDistance)) +
  geom_violin(aes(fill= wkdayend), alpha=0.5, width=0.8) +
  geom_boxplot(width =0.3, aes(fill=wkdayend), alpha=0.4) +
  ggtitle("Wider distribution of total distance on weekends than weekdays", subtitle="Of the weekdays, Tuesdays have slightly greater median distance") +
  scale_x_discrete(limits = li.daysofwk) +
  scale_y_continuous(breaks = seq(0, 28, 4)) +
  labs(x = "day", y="total distance, km", fill="day type", caption = "Fitbit data of 30 users for two weeks (Sun, April 17 to Sat, April 30, 2016) \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position="plot",
        plot.title= element_text(face="bold"))
vio.distanceDay


# ggsave("./plots/vio.distanceDay.png" , width = 8, height = 5, dpi=300)

Share – Weekly Activity Target Times

In the bar graph, over three quarters of users achieved the weekly goal of 150+ minutes of light-to-moderate exercise per week for all three weeks. Around 30% of users exercised more than 75+ minutes per week for all three weeks, while about 30% of users had 0 weeks.

Based on this sample, most people meet the weekly exercise recommendations for light-to-moderate exercise, while very active exercise is more variable among users.

# visualize percentage results in bar grpah, facet wrap by activity type
bar.wkActivity <- numUsers.wkActivity %>% 
  ggplot(aes(x = wk_count, y = pcts_users)) +
  geom_bar(aes(fill=activity), stat="identity") + 
  geom_text(aes(label = count_users), vjust=1.3) +
  ggtitle("In three weeks: most users met low-moderate activity goals, \n about a third met very active goals") +
  scale_y_continuous(labels=scales::label_percent(accuracy=1)) +
  scale_fill_manual(values = c("palegreen3", "sienna2"), labels= c("low or moderate (>=150 min)", "very (>=75 min)")) +
  labs(subtitle = "Fitbit minute intensity data for three weeks (Sun, April 17 to Sat, May 07, 2016)", x = "weeks", y = "users (32 total)", caption = "Labels on bars are number of users \nSource: Furberg et al. 2016. Zenodo.", fill = "weekly activity goals") +
  theme_minimal() +
  theme(plot.caption.position="plot",
        plot.title= element_text(face="bold")) +
  facet_wrap(~activity) +
  coord_cartesian(ylim = c(0, 1))
  
bar.wkActivity


# ggsave("./plots/bar.wkActivity.png" , width = 8, height = 5, dpi=300)

Hourly Activity Data

Prepare – Merge hourly activity datasets

1. Merging and filtering steps, calories, intensity [sci] data

Looking at the other data sets beyond daily activity, I noticed I could combine the data in hourly calories, hourly steps, and hourly intensity into one data frame. I then filtered the data to only hours when people are regularly moving and exercising, between 6 AM and 9 PM, and again, for dates between a three week period, Sunday, April 17, 2016 and Saturday, May 07, 2016.

#View(h.calories)
#View(h.steps)
#View(h.intensities)

fn.countId <- function(dt) {
  dt <- dt %>% count(Id)
  return(glimpse(dt))
}

fn.countId(h.calories) %>% glimpse() #33
Rows: 33
Columns: 2
$ Id <dbl> 1503960366, 1624580081, 1644430081, 1844505072, 1927972279, …
$ n  <int> 717, 736, 708, 731, 736, 736, 736, 735, 414, 736, 472, 696, …
Rows: 33
Columns: 2
$ Id <dbl> 1503960366, 1624580081, 1644430081, 1844505072, 1927972279, …
$ n  <int> 717, 736, 708, 731, 736, 736, 736, 735, 414, 736, 472, 696, …
#View(sc)
# steps + calories
sc <- merge(h.steps, h.calories, by=c("Id","ActivityHour"))
# steps + calories + intensities
sci <- merge(sc, h.intensities, by=c("Id","ActivityHour"))
sci <- merge(sci, Id.new, by="Id", all.x=TRUE) # add Id2
#View(sci)


# set timezone
Sys.setenv(tz="America/New_York")
Sys.timezone()
[1] "America/New_York"
# convert ActivityHour to datetime format
sci$datetime <-strptime(sci$ActivityHour, "%m/%d/%Y %I:%M:%S %p")

# extract hour of day from time, isolate hour
fn.extractHr <- function(dt){
  dt$date <- as.Date(dt$datetime, format="%m/%d/%Y")
  dt$time <- as.POSIXct(dt$datetime, format="%H:%M %p")
  dt$hour <- format(dt$time, format="%I %p")
  dt$hour_min <- format(dt$time, format="%I:%M %p")
  return(dt)
}


sci <- fn.extractHr(sci)
#View(sci)
sci$Day <- weekdays(sci$date)

# range of hours when most people are moving (not sleeping)
amHrs <- paste(6:9, "AM", sep=" ")
amHrs <- paste0(0, amHrs)
midHrs <- paste(10:11, "AM", sep=" ")
pmHrs <- paste(1:9, "PM", sep=" ")
pmHrs <- paste0(0, pmHrs)
movingHrs <- c(amHrs, midHrs, "12 PM", pmHrs)
#glimpse(movingHrs)

# filter sci data for select hours and filter for dates between 
sci_fil <- sci %>% filter(hour %in% movingHrs) %>% filter(between(date, as.Date("2016-04-17"), as.Date("2016-05-07")))
#View(sci_fil)


# examine Id2 in sci_fil - Does it match users in userTypes? YES
Id.sci_fil <- sci_fil %>% group_by(Id2) %>% summarize(count = n()) # 32 users
#View(Id.sci_fil)
#View(userTypes)

testIdmatch <- merge(x=Id.sci_fil, y = userTypes, by="Id2", all.x = TRUE )
#View(testIdmatch)

# merge sci with userTypes
sci_fil <- merge(x=sci_fil, y=userTypes, by="Id2", all.x = TRUE )

Analyze – Group data by hour and day

2. What hours of each day are people exercising?

After prepping the combined hourly steps, calories, and intensity [sci_fil] for analysis, I analyzed activity trends by hour of day to see when users were exercising the most and found mean averages for each measurement.

#View(sci_fil)

fn.hourDay.avgs <- function(dt) {
  dt <- dt %>% group_by(hour, Day) %>% 
    summarize(avgStepTotal = mean(StepTotal), avgCalories = mean(Calories),
            avgTotalIntensity = mean(TotalIntensity)) %>%
  mutate(across(where(is.numeric), round))
  return(dt)
}
  
sciAvgs.DayHour <- fn.hourDay.avgs(sci_fil)
`summarise()` has grouped output by 'hour'. You can override using the `.groups` argument.
#View(sciAvgs.DayHour)

Share – Intensity heatmaps by hour and day of week

I plotted the hourly activity data [sci], using different heatmaps. When making the heatmaps, I referenced Yan Holtz’s the R Graph Gallery’s heatmap tutorial, as well as the post on RColorBrewer’s palettes, which was helpful in selecting the colors for the heatmaps.


# Overall average total intensities by hour and day of the week
heat.DayHour.intensity <- sciAvgs.DayHour %>% ggplot(aes(x = hour, y=Day, fill=avgTotalIntensity)) +
  geom_tile() +
  ggtitle("On Average, More Intense Activity on Wednesdays after 5 p.m., Saturday 1 p.m., Tuesdays at Noon") +
  labs(fill = "average intensity", subtitle="More intense workouts after 5 p.m. in earlier weekdays of the week (Mon.- Wed.)", caption = "Fitbit data for three weeks (Sun, April 17 to Sat, May 07, 2016) \nSource: Furberg et al. 2016. Zenodo.", y="day") +
  scale_x_discrete(limits= movingHrs) +
  scale_y_discrete(limits= rev(li.daysofwk)) +
  scale_fill_distiller(palette="Oranges", direction =1) +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title.position = "plot",
        plot.title= element_text(face="bold"))

heat.DayHour.intensity


# ggsave("./plots/heat.DayHour.intensity.png" , width = 10.5 , height = 7, dpi=300)

# Overall average total steps by hour and day of the week
heat.DayHour.steps <- sciAvgs.DayHour %>% ggplot(aes(x = hour, y=Day, fill=avgStepTotal)) +
  geom_tile() +
  ggtitle("On Average, More Steps on Wednesdays 5-6 p.m., Saturday 1-2 p.m., and Tuesdays at Noon") +
  labs(fill = "average steps", subtitle="More intense workouts after 5 p.m. on earlier weekdays of the week (Mon.-Wed.)", caption = "Fitbit data for three weeks (Sun, April 17 to Sat, May 07, 2016) \nSource: Furberg et al. 2016. Zenodo.", y="day") +
  scale_x_discrete(limits= movingHrs) +
  scale_y_discrete(limits= rev(li.daysofwk)) +
  scale_fill_distiller(palette="GnBu", direction =1) +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title.position="plot",
        plot.title= element_text(face="bold"))
heat.DayHour.steps


# ggsave("./plots/heat.DayHour.steps.png" , width = 10.5 , height = 7, dpi=300)

# separate data by userType 
avgsByHours_LM <- sci_fil %>% filter(user_type == "LM") %>% fn.hourDay.avgs()
`summarise()` has grouped output by 'hour'. You can override using the `.groups` argument.
#View(avgsByHours_LM)

avgsByHours_LMV <- sci_fil %>% filter(user_type == "LMV") %>% fn.hourDay.avgs()
`summarise()` has grouped output by 'hour'. You can override using the `.groups` argument.
#View(avgsByHours_LMV)


# for frequent light-moderate exercisers only - average total intensities by hour and day of week
heat.LM.intensity <- avgsByHours_LM %>% 
  ggplot(aes(x = hour, y = Day, fill=avgTotalIntensity)) +
  geom_tile() +
  ggtitle("On Average for Light/Moderate Exercisers: More Intense Activity on Weekends, and \nWeekdays After 6 p.m.") +
  labs(fill = "average intensity", subtitle="Monday, Tuesday, and Wednesday show more intense activity, of the weekdays", caption = "Fitbit data for three weeks (Sun, April 17 to Sat, May 07, 2016) \nSource: Furberg et al. 2016. Zenodo.", y="day") +
  scale_y_discrete(limits= rev(li.daysofwk)) +
  scale_x_discrete(limits=movingHrs) +
  scale_fill_distiller(palette="YlGn", direction =1) +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title.position = "plot",
        plot.title= element_text(face="bold"))
heat.LM.intensity


# ggsave("./plots/heat.LM.intensity.png", width = 10.5 , height = 7, dpi=300)

# frequent light-moderate-very active exercisers - average total intensities by hour and day of week
heat.LMV.intensity <- avgsByHours_LMV %>% 
  ggplot(aes(x = hour, y = Day, fill=avgTotalIntensity)) +
  geom_tile() +
  ggtitle("On Average for Very Active Exercisers: More Activity After 5 p.m., Monday through Wednesday") +
  labs(fill = "average intensity", subtitle="Monday, Tuesday, and Saturday hours around noon and Sunday nights also have more intense activity", caption = "Fitbit data for three weeks (Sun, April 17 to Sat, May 07, 2016) \nSource: Furberg et al. 2016. Zenodo.", y="day") +
  scale_y_discrete(limits= rev(li.daysofwk)) +
  scale_x_discrete(limits=movingHrs) +
  scale_fill_distiller(palette="OrRd", direction =1) +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title.position = "plot",
        plot.title = element_text(face="bold"))
heat.LMV.intensity


# ggsave("./plots/heat.LMV.intensity.png" , width = 10.5 , height = 7, dpi=300)

Key Findings – Hourly Activity

Based on the Fitbit hourly exercise heatmaps, I would suggest the following to improve Bellabeat Time:

  • Provide a questionnaire for users in the tracking app to ask what kind of exercise intensity they intend on doing on a weekly basis (very, moderate, or light activity) and suggest optimal exercise hours based on hourly trends of Fitbit users, grouped by exercise intensity type.

  • As weekdays after 5 p.m. and weekend afternoons are on average more popular intense activity workout times, giving users the option to create an hourly exercise schedule, including setting reminders to exercise at certain times of the day could improve consistency of workouts.

Sleep Tracking

Prepare – Clean sleep tracking data

1. Cleaning the sleep data The original sleepDay only has data for 24 users. This small sample size, being < 30, is a limitation of the data analysis.

First, I wanted to see how consistently each user logged their sleep minutes over a period of time. Again, I performed the usual preparation steps, assigning Id2 to each Id for easier readability and joining the daily activity data with the sleep data.

When cleaning the data, I found three users with duplicate data for the same dates. I removed these duplicate rows before plotting the data.


# merge sleepDay and d.activity
#View(sleepDay)
#View(d.activity)
sleepDay$date <- as.Date(sleepDay$SleepDay, format="%m/%d/%Y")
sleepDay <- merge(x=sleepDay, y=Id.new, by="Id", all.x = TRUE )
#glimpse(sleepDay) # 413 rows

sleepDay_sel <- sleepDay %>% select(TotalSleepRecords: Id2) 

sleepId <- sleepDay_sel %>% group_by(Id2) %>% summarize(n()) %>% glimpse()
Rows: 24
Columns: 2
$ Id2   <chr> "A", "C", "D", "E", "G", "H", "I", "L", "M", "O", "P", "Q…
$ `n()` <int> 25, 4, 3, 5, 28, 1, 15, 28, 8, 26, 24, 28, 5, 28, 31, 26,…
#View(sleepId)

sleepDates <- sleepDay_sel %>% group_by(date) %>% summarize(n()) %>% glimpse()
Rows: 31
Columns: 2
$ date  <date> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-15, 2016-04-…
$ `n()` <int> 13, 14, 13, 17, 14, 12, 10, 14, 15, 15, 13, 15, 13, 13, 1…
#View(sleepDates)

# check sleep data for count by date and user Id2
sleepTest <- sleepDay_sel  %>% group_by(date, Id2) %>% summarize(count = n()) %>% arrange(desc(count))
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
#View(sleepTest)

## Id2: P, S, ZD each have 2 entries for same date (duplicate data)
sleepTest2 <- sleepDay_sel %>% filter( Id2 == "ZD" & date == as.Date("2016-04-25") | Id2 == "P" & date == as.Date("2016-05-05") |
                                          Id2 == "S" & date== as.Date("2016-05-07"))
# View(sleepTest2) 

# remove duplicated rows in sleep data
sleepDay_sel <- sleepDay_sel %>% distinct() %>% glimpse() # 410 rows
Rows: 410
Columns: 5
$ TotalSleepRecords  <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361,…
$ TotalTimeInBed     <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384,…
$ date               <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04…
$ Id2                <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A",…
# merge sleep and activity data
sleepActivity <- merge(x=sleepDay_sel, y=d.activity, by=c("Id2", "date"), all.x=TRUE)

# add columns for total minutes of exercise
sleepActivity <- sleepActivity %>% mutate(LightModMinutes = (LightlyActiveMinutes + FairlyActiveMinutes)) %>% 
  mutate(TotalActiveMinutes = (VeryActiveMinutes + LightModMinutes))

#View(sleepActivity)
# add exercise intensity type of each user
sleepActivity.userTypes <- merge(x = sleepActivity, y = userTypes, by = "Id2", all.x=TRUE)

#(sleepActivity.userTypes)

Daily Sleep Records

Analyze – Consistency of sleep records per user

Again, I filtered the daily sleep data to three weeks. I wanted to see how consistent users were recording their sleep time, so I counted the total number of date entries for each user.

# filter data for three week period
sleepActivity.userTypes_fil <- sleepActivity.userTypes %>% filter(between(date, as.Date("2016-04-17"), as.Date("2016-05-07")))

#View(sleepActivity.userTypes_fil)
# How consistent is the data?
## How many days each user logged sleep data?
sleepUsers <- sleepActivity.userTypes_fil %>% group_by(Id2) %>% summarize(num_days=n()) %>% glimpse() # 23 users
Rows: 23
Columns: 2
$ Id2      <chr> "A", "C", "D", "E", "G", "H", "I", "L", "M", "O", "P",…
$ num_days <int> 17, 3, 2, 2, 18, 1, 12, 21, 4, 18, 17, 18, 4, 18, 21, …
#View(sleepUsers)

## count how many users logged how many days
sleep.dateCount <- sleepUsers %>% ungroup() %>% group_by(num_days) %>% summarize(user_count=n())

userCount <- sleep.dateCount %>% summarize(totalUsers = sum(user_count))
#glimpse(userCount)

Share – Visualize user percentage by number of days

The bar graph shows the proportion of 23 users that logged their sleep data for how many days of the selected three weeks. Only 4 users (around 18%) logged sleep minutes for all 21 days.

# plot to show percentage of users and record of sleep data over time

bar.sleep.dates <- sleep.dateCount %>% ggplot(aes(x=num_days, y = user_count/ sum(user_count))) + 
  geom_bar(stat="identity", fill = "slategray2") +
  geom_text(aes(label = user_count), vjust=-.25) +
  ggtitle("Fewer than 20% of users recorded sleep data daily for 3 weeks") +
  labs( x = "days", y = "users (23 total)", subtitle="Fitbit sleep tracking data for April 17 through May 7, 2016", caption="Bar labels are number of users \nSource: Furberg et al. 2016. Zenodo.") +
  scale_x_continuous(breaks=seq(1, 21, 2)) +
  scale_y_continuous(labels=scales::label_percent(accuracy=1)) +
  coord_cartesian(ylim=c(0, 1)) +
  theme_minimal() +
  theme(plot.title = element_text(face="bold"))

bar.sleep.dates


# ggsave("./plots/bar.sleep.dates.png" , width = 9 , height = 7, dpi=300)

Bar graph references:

  • Thomas Neitmann’s blogpost about percentage scales was a helpful explainer on how to use the scales package. Note that percent() and percent_format() are retired, according to the scales package documentation, so I used label_percent() instead.

  • To adjust the y-axis scale limits to show 0 to 100%, I used coord_cartesian(). To read more about this, see roelpi’s blog post about adding percentage limits using ggplot2

  • I referred to a Stack Overflow discussion for adding labels to bars in ggplot2, namely the response from user rcs to adjust the placement using vjust.

Key Findings – Sleep Records

The lack of consistent daily sleep logging among Fitbit users shows a problematic trend. Perhaps users are not wearing their tracking watches while they sleep. * I suggest that the Bellabeat team create a wearable sleep tracking ring, which some companies like Oura have designed. This could act as a companion to the Time watch, and the smaller device might encourage more people to wear a tracker when sleeping. * Providing daily reminders to wear the watch while sleeping, through the Bellabeat app, could help promote consistency of sleep tracking data collection.

Healthy Daily Sleep

Analyze – Target sleep day percentages VS. user percentages

2. How often do Fitbit users get enough daily sleep?

According to a 2015 study, Watson et al, cited by the CDC, the healthy amount of sleep for an adult is 7 hours or more a night. Since the sleep data is in minutes, this converts to 420+ minutes a night.

I categorized each day as >= 420 min. or < 420 min. to determine each user’s percentage of total days of sleep records. By grouping the data further, I found the total count of users and the percentage of healthy days of the sleep days recorded per user.

Next, I grouped the percentage of healthy sleep data into percent ranges by quarters and recounted the total users per percent range and the respective user percentages. This data is easier to visualize than individual non-grouped healthy sleep day percentages.


# For each user, how often did they get a healthy amount of sleep? 
#View(sleepActivity.userTypes_fil)

healthy.sleepDays <- sleepActivity.userTypes_fil %>%
  mutate(sleepTarget = case_when(TotalMinutesAsleep >= 420 ~ "healthy",
                                 TotalMinutesAsleep < 420 ~ "no"))

## percentages by number of days they recorded sleep data
healthy.sleepDays.byUser <- healthy.sleepDays %>% group_by(Id2) %>% 
  summarize(healthy_sleep_days = sum(sleepTarget == "healthy"),
            total_days = n()) %>%
  mutate(pct_healthy_sleep = round(healthy_sleep_days / total_days, digits=2))

#View(healthy.sleepDays.byUser)

## group by percentages of healthy sleep days to find user counts and percentages
healthy.sleepDays.pctUsers <- healthy.sleepDays.byUser %>% group_by(pct_healthy_sleep) %>% 
  summarize(count_user = n()) %>% 
  mutate(pct_user = round( count_user / sum(count_user), digits=2)) %>% ungroup()
glimpse(healthy.sleepDays.pctUsers)
Rows: 16
Columns: 3
$ pct_healthy_sleep <dbl> 0.00, 0.12, 0.25, 0.28, 0.33, 0.45, 0.52, 0.5…
$ count_user        <int> 5, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2
$ pct_user          <dbl> 0.22, 0.04, 0.04, 0.04, 0.09, 0.04, 0.04, 0.0…
sum(healthy.sleepDays.pctUsers$count_user) #23 total users
[1] 23
ranges.healthySleepDays.pctUsers <- healthy.sleepDays.pctUsers %>% 
  mutate( ranges_sleepDays = dplyr::case_when( pct_healthy_sleep >= 0.75 ~ "75-100%",
                    pct_healthy_sleep >= 0.50 ~ "50-74%",
                     pct_healthy_sleep >= 0.25 ~ "25-49%",
                    pct_healthy_sleep > 0 ~ "0-24%"))

# replace NA with 0-24%
ranges.healthySleepDays.pctUsers[is.na(ranges.healthySleepDays.pctUsers)] = c("0-24%")       

ranges.healthySleepDays.pctUsers <- ranges.healthySleepDays.pctUsers %>% 
  select(!pct_user) %>% 
  group_by(ranges_sleepDays) %>% 
  summarize(count_user = sum(count_user)) %>% ungroup() %>% 
  mutate(pct_user = round( count_user / sum(count_user), digits=2)) %>% ungroup()
  
#View(ranges.healthySleepDays.pctUsers)
            

Share – Visualize target sleep day percentages by users

In the bar chart, fewer than 25% of Fitbit users (5 users) recorded at least 75% days of healthy sleep duration (at least 7 hours a day), while fewer than 25% of users (6 users) had fewer than a quarter days of enough sleep. This suggests that users could use more consistency in the amount of sleep they are getting.


bar.pctHealthySleepDays <- ranges.healthySleepDays.pctUsers %>% 
  ggplot(aes(x = ranges_sleepDays, y = pct_user)) +
  geom_bar(stat="identity", fill="lightskyblue") +
  geom_text(aes(label = count_user), vjust = 1.4) +
  ggtitle("Fewer than 1/4 Fitbit users had at least \n75% of healthy sleep days (7+ hours/day)") +
  labs(x = "days of healthy sleep", y="users (23 total)", subtitle="Over 1/4 users had fewer than 1/4 total healthy sleep days (April 17 to May 07, 2016)", caption = "Bar labels are number of users \nSource: Furberg et al. 2016. Zenodo.") +
  scale_y_continuous(labels=scales::label_percent(accuracy=1)) +
  scale_x_discrete() +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title= element_text(face="bold")) +
  coord_cartesian(ylim = c(0, 1))
bar.pctHealthySleepDays 


# ggsave("./plots/bar.pctHealthySleepDays.png" , width = 10, height = 7, dpi=300)

Key Findings – Suggestions for More Healthy Sleep Days

As nearly half of the Fitbit users had fewer than 50% healthy sleep days of their total sleep records, another sleep data trend is that most people are not sleeping the healthy recommended amount of 7+ hours per day.

I suggest that Bellabeat create a function in the app for a user to select target sleep and waking times that reflect a healthy amount of daily sleep. The addition of notification reminders for users to sleep at a certain time could help increase the number of users who are getting enough daily sleep.

Sleep VS. Exercise Activity

Analyze

If a person, on average, sleeps enough in a week, does that correlate with them spending more time exercising during that same week?

I wanted to examine:

  • Are sleep times and activity times correlated?
  • Are sleep times and calories correlated?

Taking the merged sleep and daily activity dataset, I grouped the data by user and week, and found the average values for total sleep for each week and the total weekly activity minutes, steps, calories, as well as the mean weekly calories.

# Is there a correlation of healthy daily sleep and exercise activity?
wk.sleepActivity <- sleepActivity.userTypes_fil %>% ungroup() %>% group_by(Id2, week_date) %>% 
  summarize(avgSleepMin = mean(TotalMinutesAsleep),
            medianSleepMin = median(TotalMinutesAsleep),
            wkActiveMin = sum(TotalActiveMinutes),
            wkLightModMin = sum(LightModMinutes),
            wkVeryActiveMin = sum(VeryActiveMinutes),
            totalSteps = sum(TotalSteps),
            totalCalories = sum(Calories),
            avgCalories = mean(Calories))
`summarise()` has grouped output by 'Id2'. You can override using the `.groups` argument.
wk.sleepActivity <- merge(x = wk.sleepActivity, y = userTypes, by="Id2", all.x= TRUE)

# View(wk.sleepActivity)

Share

For the first scatter plot showing average sleep time and total weekly activity time, I colored the points by type of exerciser, using the user intensity types from the earlier activity analysis. The size of the points reflects the total weekly calories. Both groups of exercisers show a slightly positive correlation of total weekly exercise time and weekly average sleep time.

In the second scatter plot, total weekly calories is also slightly positively correlated with weekly average sleep time, though there are outlier cases.

# weekly active minutes vs. average sleep minutes
scatter.wk.sleepActiveMin <- wk.sleepActivity %>% 
  ggplot(aes(x = avgSleepMin, y = wkActiveMin)) +
  geom_point(aes(color = user_type, size=totalCalories)) +
  geom_smooth(method="lm", se = FALSE, alpha = 0.2, aes(color = user_type)) +
  geom_vline(xintercept = 420, linetype = "dashed", color = "red") +
  ggtitle("Weekly Average Sleep and Activity Times Are Positively Correlated") +
  scale_y_continuous(breaks = seq(0, 28000, 4000)) +
  scale_color_manual(values = c("palegreen3", "sienna2", "lightgoldenrod2"), labels= c("low or moderate", "low, moderate, or very", "not active")) +
  labs(subtitle = "Weekly Fitbit data for April 17 to May 7, 2016", x = "average sleep minutes", y = "total weekly active minutes", size = "total week calories", color = "type of exerciser", caption = "Dotted line is healthy sleep per day \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title= element_text(face="bold"))
scatter.wk.sleepActiveMin
`geom_smooth()` using formula 'y ~ x'

# ggsave("./plots/scatter.wk.sleepActiveMin.png" , width = 10, height = 7, dpi=300)

# total weekly calories vs. average sleep minutes
scatter.wk.sleepCalories <- wk.sleepActivity %>% 
  ggplot(aes(x = avgSleepMin, y = totalCalories)) +
  geom_point() +
  geom_smooth(method="lm", se = FALSE, color = "slategray2") +
  geom_vline(xintercept = 420, linetype = "dashed", color = "red") +
  ggtitle("Weekly Average Sleep Time and Total Calories Are Positively Correlated") +
  scale_y_continuous(breaks = seq(0, 28000, 4000)) +
  labs(subtitle = "Weekly Fitbit data for April 17 to May 7, 2016", x = "average sleep minutes", y = "total weekly calories", caption = "Dotted line is healthy sleep per day \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position="plot",
        plot.title= element_text(face="bold"))
scatter.wk.sleepCalories
`geom_smooth()` using formula 'y ~ x'

# ggsave("./plots/scatter.wk.sleepCalories.png" , width = 10, height = 7, dpi=300)

Key Findings – Sleep and exercise activity

More sleep is positively correlated with more exercise and calorie burning, by week. If Bellabeat Time users are provided daily reminders to meet sleep targets (7+ hours / day), that could in turn, improve their exercise activity.

Summary of Project Analysis

For this case study, I analyzed Fitbit data related to activity and sleep to examine trends among fitness tracking users, to then inform insights for improving the Bellabeat Time fitness watch, which is designed to track user activity and sleep patterns, among other measures.

Activity

I analyzed daily activity data related to activity intensities (very, moderate, and lightly active), calories, and steps. By determining whether or not users achieved healthy light/moderate or very active times per week and how frequently they met those goals (a majority of weeks being at least 2 of 3 weeks), I found two main groups of exercisers:

  • Light/moderate exercisers (but not very active) – LM
  • light/moderate and very active exercisers – LMV

Then, I looked at hourly activity data, using this same grouping of exerciser types. I examined when people most frequently exercised, based on average total intensities by hour and day of the week and determined separate trends for different types of exercisers, along with trends for the overall users.

Activity Trends

  • Wider distribution of amount of activity on weekends than weekdays
  • Of the weekdays, slightly greater median of activity intensity on Tuesdays
  • More weekly calories burned in less time and shorter distance for users who frequently are very active, as compared to exercisers who are only frequently light/moderately active
  • Overall, more intense workouts occur on Tuesdays at noon, Saturdays at 1 p.m., and after 5 p.m. on weekdays
  • Frequent light/moderate exercisers tend to be more active on weekends, and weekdays after 6 p.m.
  • Frequent very active exercisers tend to be more active on Monday, Tuesday, and Saturday around noon; and Sunday through Wednesday after 5 p.m.

Sleep

For the sleep day data, I analyzed the following and found these trends:

  • How often were users recording sleep data?
    • Fewer than 20% of users consistently recorded sleep data over three weeks
  • How often were users getting a healthy amount of sleep each day (at least 7 hours), of the total days they recorded sleep data?
    • Fewer than a quarter of users had at least 75% days of healthy sleep
    • Over a quarter of users had only less than 25% days of healthy sleep
  • By merging the sleep and activity data and visualizing total weekly active minutes and average sleep minutes per week, there is a slightly positive correlation between sleep and activity for exercisers of either intensity group (LMV or LM).

Data Limitations

The data used for this analysis was limited to a small sample size (around 30 or fewer users).

In particular, the sleep dataset had information for fewer than 30 users. The sleep trends found from this data analysis are not a significant representation of a larger population, due to the small sample population and missing data.

As the data collected was for a short time period, between April and May 2016, the health activity and sleep patterns are limited to an optimal season of exercise (spring/fall, depending on the hemisphere location of each user). Activity and sleep trends could vary during other seasons of the year.

Due to the anonymity of user Ids, no information was known about users’ biological sex or age, nor work hours, living conditions, and class status, characteristics which could impact a person’s exercise and sleep schedules.

Act – High-level Recommendations for Stakeholders

The following recommendations are based on trends from the analysis and can be applied to Bellabeat’s fitness tracking watch, Time.

Activity tracking suggestions

  • Provide a questionnaire for users to determine the type of intensity workouts they plan on doing regularly and an option to set weekly exercise goals
  • Provide a weekly summary graph to show times of activity intensity per day and weekly totals per type of intensity (light/moderate and very active)
  • Include a reminder of optimal intensity exercise goals during the week and the difference of time needed to complete that weekly goal, based on exercise intensity time already spent
  • Give users the option to create hourly exercise schedules with reminder notifications
  • Suggest optimal exercise hours based on the type of exercise they intend on doing, as answered in their questionnaire

Sleep tracking suggestions

  • Create an optional sleeping tracking ring to act as a companion to the Time watch. The ring is a smaller wearable option that could encourage users to wear a tracking device daily when sleeping.
  • Provide the option of daily notifications to remind users to wear the watch while sleeping
  • Allow users to select optimal sleep and waking times and set reminders to go to bed by that certain time
  • Inform users of the benefits of healthy amounts of sleep, correlated with more weekly exercise

References

The following references are organized into sections by order of appearance.

Data, Background Information

  1. Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016. [Data set]. Zenodo. https://doi.org/10.5281/zenodo.53894. Retrieved from https://www.kaggle.com/arashnic/fitbit.
  2. Fitabase. (2018). Fitabase Data Dictionary. Retrieved from https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf
  3. National Health Interview Survey (NHIS), Centers for Disease Control and Prevention (CDC) / National Center for Health Statistics (NCHS). (2022). Physical Activity. Retrieved from https://www.healthypeople.gov/2020/topics-objectives/topic/physical-activity/national-snapshot.
  4. CDC. (2022) Physical Activity. Retrieved from https://www.cdc.gov/physicalactivity/basics/adults/index.htm.
  5. Watson NF, Badr MS, Belenky G, et al. Recommended amount of sleep for a healthy adult: a joint consensus statement of the American Academy of Sleep Medicine and Sleep Research Society. Sleep. 2015;38(6):843–844. http://dx.doi.org/10.5665/sleep.4716.

Coding Resources

  1. Soni, Yash. (2018) How I analyzed the data from my Fitbit to improve my overall health. freeCodeCamp. https://www.freecodecamp.org/news/how-i-analyzed-the-data-from-my-fitbit-to-improve-my-overall-health-a2e36426d8f9/.
  2. Kassambara, Alboukadel. ggcorrplot: Visualization of a correlation matrix using ggplot2 (0.1.3). https://rpkgs.datanovia.com/ggcorrplot/.
  3. Holtz, Yan. (2018) Violin plot with included boxplot and sample size in ggplot2. The R Graph Gallery. https://www.r-graph-gallery.com/violin_and_boxplot_ggplot2.html.
  4. Kamvar, Zhian N. (2021) Package aweek: Convert Dates to Arbitrary Week Definitions(1.02). Retrieved from https://cran.r-project.org/web/packages/aweek/aweek.pdf.
  5. Li, Deanna. (2020) Basic R Guide for NSC Statistics, Chapter 19: Scatterplots and Best Fit Lines - Two Sets. (Bookdown) https://bookdown.org/dli/rguide/scatterplots-and-best-fit-lines-two-sets.html.
  6. Holtz, Yan. (2018). ggplot2 heatmap. The R Graph Gallery. https://www.r-graph-gallery.com/79-levelplot-with-ggplot2.html.
  7. Holtz, Yan. (2018). R Color Brewer’s Palettes. https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html.
  8. Neitmann, Thomas. (2020) Transform a ggplot2 Axis to a Percentage Scale. https://thomasadventure.blog/posts/ggplot2-percentage-scale/.
  9. Wickham, Hadley and Seidel, Dana. (2020) Package scales: Scale Functions for Visualization(1.1.1). https://cran.r-project.org/web/packages/scales/scales.pdf.
  10. roelpeters.be. (2019) Add percentages to your axes in R’s ggplot2 (and set the limits). https://www.roelpeters.be/add-percentages-to-your-axes-in-rs-ggplot2-and-set-the-limits/.
  11. rcs. (2012) How to put labels over geom_bar for each bar in R with ggplot2. Stack Overflow. https://stackoverflow.com/questions/12018499/how-to-put-labels-over-geom-bar-for-each-bar-in-r-with-ggplot2.
---
title: "Case Study - Fitness and Health Tracking Analysis"
output:
  html_document:
    df_print: paged
  html_notebook: default
author: "Julia Hong"
date: March 29, 2022
---

### Intro

#### Project overview
In this case study, I am a junior data analyst who is working for the marketing analytics team at Bellabeat, a high-tech company that designs health tracking products for women. This hypothetical scenario is provided by Google's Data Analytics Certificate Program through Coursera, and I will be outlining the standard data analysis pathway throughout this project (ask, prepare, process, analyze, share, and act).


### Ask

#### Business task
My main objective for this project is the following:
Based on trends in non-Bellabeat smart devices for tracking health data, what trends occur among consumers and how can they be applied to a Bellabeat device and the company's marketing strategy?

#### Stakeholders
After my analysis, I would share my high-level findings and provide suggestions to improve a Bellabeat device to the main stakeholder, Urška Sršen, one of the two founders of Bellabeat. I would also share my findings and more detailed analyses and code with the data analytics team prior to sharing high-level results with the founder.

### Prepare

#### Non-Bellabeat Data
The data for non-Bellabeat devices was downloaded from Kaggle user Möbius: [Fitbit Fitness Tracker Data](https://www.kaggle.com/arashnic/fitbit). This data collection contains 18 CSV files about exercise activity, intensity, calories, steps, METS (metabolic equivalent of task), heartrate, weight, and sleep across different time measurements (days, hours, minutes, seconds).

The original datasets were created by Furberg, R. et al, published on [Zenodo](https://zenodo.org/record/53894#.YjQHXi-B2lK) in May 2016. It is open-sourced with a [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode).
The data is based on a 2016 crowdsourced survey of Fitbit users, conducted by Amazon Mechanical Turk.


```{r tidyverse package}
library(tidyverse)

```


#### Load datasets

When loading data, I wanted to name the dataframes with meaningful names. To simplify the file names, I used the following key for time measurements:

* daily - d
* hourly - h
* minutes - m
* seconds - s
* Narr - Narrow
* Wide - Wide

Dataframe names begin with a time measurement, followed by a period for readability, and a description of the health measurement.

```{r data sets}
# I commented out dataframes that won't be used in this analysis.
# To view dataframes, uncomment each View(dataframe_name) and re-run code

# activity
d.activity <- read_csv("./data/dailyActivity_merged.csv", show_col_types = FALSE) 
# View(d.activity)
# d.activity contains the data found separately in d.intensities, d.calories, d.steps

# intensities
# d.intensities <- read_csv("./data/dailyIntensities_merged.csv", show_col_types = FALSE)
# View(d.intensities)
h.intensities <- read_csv("./data/hourlyIntensities_merged.csv", show_col_types = FALSE)
#View(h.intensities)
# m.intensitiesNarr <- read_csv("./data/minuteIntensitiesNarrow_merged.csv", show_col_types = FALSE)
#View(m.intensitiesNarr)
# m.intensitiesWide <- read_csv("./data/minuteIntensitiesWide_merged.csv", show_col_types = FALSE)


# calories
# d.calories <- read_csv("./data/dailyCalories_merged.csv", show_col_types = FALSE)
# View(d.calories)
h.calories <- read_csv("./data/hourlyCalories_merged.csv", show_col_types = FALSE)
# View(h.calories)
# m.caloriesNarr <- read_csv("./data/minuteCaloriesNarrow_merged.csv", show_col_types = FALSE)
# m.caloriesWide <- read_csv("./data/minuteCaloriesWide_merged.csv", show_col_types = FALSE)
# View(m.caloriesWide)

# steps
# d.steps <- read_csv("./data/dailySteps_merged.csv", show_col_types = FALSE)
# View(d.steps)
h.steps <- read_csv("./data/hourlySteps_merged.csv", show_col_types = FALSE)
# View(h.steps)
# m.stepsNarr <- read_csv("./data/minuteStepsNarrow_merged.csv", show_col_types = FALSE)
# View(m.stepsNarr)
# m.stepsWide <- read_csv("./data/minuteStepsWide_merged.csv", show_col_types = FALSE)
# View(m.stepsWide)

# METs - metabolic equivalent of task (relates to intensity)
# minuteMETS <- read_csv("./data/minuteMETsNarrow_merged.csv", show_col_types = FALSE)
# View(minuteMETS)

# heartrate
# s.heartrate <- read_csv("./data/heartrate_seconds_merged.csv", show_col_types = FALSE)

# weight
# weightLog <- read_csv("./data/weightLogInfo_merged.csv", show_col_types = FALSE)
# View(weightLog)

# sleep
sleepDay <- read_csv("./data/sleepDay_merged.csv", show_col_types = FALSE)
# View(sleepDay)


```


After loading the data in R, I noticed some column values, like distance or METS were vague in meaning. For instance, I wondered, what measurement is the Fitbit tracker using for total distance? I found the [metadata for the FitBase files](https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf), the Data Dictionary, on the FitBase website. This metadata has explanations on what each column value means. Turns out, the total distance is measured in kilometers.


#### Bellabeat Device -- Time

While the loaded data is a lot of files, I will only be analyzing specific health measurements for the scope of this project -- in particular, those that relate to the Bellabeat device of choice. After reviewing Bellabeat's products, I chose to focus on [Time](https://bellabeat.com/time/), a health and wellness tracking watch for:

* Active time
* Sleep time
* Mindfulness
* Stress resistance
* Menstrual cycle

Additional benefits of the watch include wireless connectivity and syncing to the Bellabeat app, meditation and hydration tracking, and a wellness score calculation, according to the website.

I wanted to examine trends for Fitbit data on **activity** and **sleep** because those health measures are relevant to the Time watch. Of the datasets, I incorporated the following in my analysis:

* d.activity
* h.intensities
* h.calories
* h.steps
* sleepDay

These datasets are in wide format and saved as CSVs in the [data] folder within the overall project folder.

For inspiration on ways to analyze and visualize Fitbit tracking data, I referred to Yash Soni's tutorial, published on [freeCodeCamp](https://www.freecodecamp.org/news/how-i-analyzed-the-data-from-my-fitbit-to-improve-my-overall-health-a2e36426d8f9/). The main difference with Soni's data is that it is specific to one individual, whereas the sample Fitbit data contains activity data for up to 33 different users, each assigned with a unique Id. But based on the tutorial, ways I can analyze data include by day of the week, using boxplots or bar graphs, or using a correlation matrix.

In addition to tidyverse, I used the following R packages:
```{r packages}

library(lubridate) # wrangling dates
library(aweek)# adding week numbers to dates
library(janitor) # cleaning dataframe column names
library(ggcorrplot) # correlation matrix
library(scales) # visualizing percentage scales

```

### Daily Activity
#### Process -- Cleaning daily activity data

**1. Adding day of week and Id2 to d.activity**

Since I planned to examine trends by day of the week, I used the weekdays() function in base R to convert each date into the day of the week (ex: Monday). For easier readability of Id numbers, I reassigned each Id user to a letter or a letter pair -- known as Id2.

```{r weekday vs weekend exercise}
d.activity$date <- as.Date(d.activity$ActivityDate, format="%m/%d/%Y")
#View(d.activity)

# extract week day
d.activity$Day <- weekdays(d.activity$date)

li.wkdays <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
li.wkends <- c("Sunday", "Saturday")
li.daysofwk <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

d.activity <- d.activity %>% mutate(wkdayend = case_when(Day %in% li.wkdays ~ "weekday", Day %in% li.wkends ~ "weekend"))
d.activity$Id <- as.character(d.activity$Id)
#View(d.activity)

# Pair each Id with a unique letter Id2. LETTERS is "A":"Z". As there are 33 users and only 26 alphabet letters, add 7 more unique letter pair as Id2.
Id2 <- append(LETTERS, c("ZA", "ZB", "ZC", "ZD", "ZE", "ZF", "ZG")) # new Ids
Id.unique <- d.activity %>% select(Id) %>% distinct() # distinct Ids
Id.new <- data.frame(Id.unique, Id2)
#View(Id.new)

# left join to d.activity
d.activity <- merge(x=d.activity, y=Id.new, by="Id", all.x=TRUE)
#View(d.activity)

```



**2. A visual overview of d.activity**

I plotted a correlation matrix to visualize potential correlation relationships between data value columns. I referenced Alboukadel Kassambara's documentation of the [ggcorrplot package](https://rpkgs.datanovia.com/ggcorrplot/), which is also on [GitHub](https://github.com/kassambara/ggcorrplot).


Sedentary minutes and distance have little to no correlation with calories, as expected. For each active intensity (very, moderate/fair, light), the minutes and distance have a positive correlation, as do the total distance and tracker distance. When analyzing the data, I will focus on the active exercise and filter out the sedentary minutes and distance.

Next, I wanted to know, how consistent are users recording their daily activity and using the Fitbit tracker? The number of entries varied between users, which can be seen when plotting the date and TotalDistance for each user in the scatter plot. I isolated the top 30 users with the most data.

```{r overview activity, fig.width= 9, fig.height=5}
# corr matrix for d.activity using GGally library
d.activity_sel <- d.activity %>% select(TotalDistance: TrackerDistance, VeryActiveDistance: Calories)

mtrx.activity <- round(cor(d.activity_sel), 1) # correlation matrix
pvals.activity <- cor_pmat(d.activity_sel) # p-values matrix

corr.activity <- ggcorrplot(mtrx.activity, outline.color = "gray28", type="lower", legend.title = "correlation") +
  ggtitle("Correlation of Daily Steps, Minutes, Distance, and Calories") +
  labs(subtitle = "Fitbit daily activity data for 33 users from April 12 to May 9, 2016", caption="Source: Furberg et al. 2016. Zenodo.") +
    theme(plot.title = element_text(face="bold"),
          plot.title.position= "plot",
          plot.caption.position = "plot")
corr.activity

# ggsave("./plots/corr.activity.png" , width = 8, height = 5, dpi=200)

 
# How consistent is the data for users?
userActivityCount <- d.activity %>% group_by(Id2) %>% 
  summarize(Count = n()) %>% arrange(desc(Count))
#View(userActivityCount)

# Plotting the date and TotalDistance for each user shows some users stopped data collection before May 09, 2016, the last day.
scatter.dateDistance <- d.activity %>% ggplot(aes(x = date, y=TotalDistance)) +
  geom_point(aes(color = Calories), stat="identity") +
  ggtitle("Some Fitbit users did not record distance for the full time period") +
  labs(subtitle = "Activity data was recorded Tuesday, April 12, 2016 to Thursday, May 12, 2016", caption="Source: Furberg et al. 2016. Zenodo.") +
  scale_x_date(date_breaks="2 weeks", date_labels="%b %d") +
  guides(color = guide_legend(reverse=TRUE)) +
  facet_wrap(~Id2, nrow=5) +
  theme_minimal() +
  theme(axis.text.x= element_text(hjust=0.5, vjust=0.3),
        axis.title.x = element_text(vjust = -0.5),
        plot.title = element_text(face="bold"), 
        plot.caption.position= "plot")
 
scatter.dateDistance
# ggsave("./plots/scatter.dateDistance.png" , width = 9, height = 5, dpi=300)


# Determine top 30 users with most daily data available
userActivity30 <- userActivityCount %>% top_n(30, Count) %>% glimpse()

#(userActivity30)
```


#### Analyze -- Find user trends in daily activity

**1. For the top 30 users, I examined when they most frequently exercise. How does their activity differ between days of the week?**

For examples on making a violin plot with a boxplot superimposed on it, I referred to Yan Holtz's the R graph gallery: [Violin plot with included boxplot and sample size in ggplot2](https://www.r-graph-gallery.com/violin_and_boxplot_ggplot2.html).

The original data for [d.activity] was from Tuesday, April 12, 2016 to Thursday, May 12, 2016. To examine trends by day of week, I selected 2 full weeks of data where 30 users have logged data. This way, there is an equal amount of data from each user for each day.


#### Share -- Daily activity by day of the week

The violin plots show a broader distribution in steps and distance on the weekends than weekdays, and Tuesdays have a greater median of steps and total distance.

```{r top30 activity by days, fig.width= 8, fig.height=5}

top30.activity <- d.activity %>% filter(Id2 %in% userActivity30$Id2)
#View(top30.activity)

count_dates <- top30.activity %>% group_by(date) %>% count() %>% glimpse()

# isolate first two full weeks (starting from Sunday)
top30.activity.fullwks <- top30.activity %>% filter(between(date, as.Date("2016-04-17"), as.Date("2016-04-30")))

check1 <- top30.activity.fullwks %>% group_by(Id2) %>% count()
#View(check1)

#View(top30.activity.fullwks)
# steps by day
vio.stepsDay <- top30.activity.fullwks %>% ggplot(aes(x=Day, y=TotalSteps)) +
  geom_violin(aes(fill= wkdayend), alpha=0.5, width=0.8) +
  geom_boxplot(width =0.3, aes(fill=wkdayend), alpha=0.4) +
  ggtitle("Wider distribution of total steps on weekends than weekdays ", subtitle="Greater median steps on Tuesdays, of the weekdays") +
  scale_x_discrete(limits = li.daysofwk) +
  scale_y_continuous(breaks = seq(0, 28000, 7000)) +
  labs(x ="day", y = "steps", fill="day type", caption = "Fitbit data of 30 users for two weeks (Sun, April 17 to Sat, April 30, 2016) \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title= element_text(face="bold"))
vio.stepsDay

# ggsave("./plots/vio.stepsDay.png" , width = 8, height = 5, dpi=300)


# distance by day
vio.distanceDay <- top30.activity %>% ggplot(aes(x=Day, y=TotalDistance)) +
  geom_violin(aes(fill= wkdayend), alpha=0.5, width=0.8) +
  geom_boxplot(width =0.3, aes(fill=wkdayend), alpha=0.4) +
  ggtitle("Wider distribution of total distance on weekends than weekdays", subtitle="Of the weekdays, Tuesdays have slightly greater median distance") +
  scale_x_discrete(limits = li.daysofwk) +
  scale_y_continuous(breaks = seq(0, 28, 4)) +
  labs(x = "day", y="total distance, km", fill="day type", caption = "Fitbit data of 30 users for two weeks (Sun, April 17 to Sat, April 30, 2016) \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position="plot",
        plot.title= element_text(face="bold"))
vio.distanceDay

# ggsave("./plots/vio.distanceDay.png" , width = 8, height = 5, dpi=300)

```


#### Analyze -- Trends by Intensity of Activity by Week

**2. How consistent did users meet weekly target exercise intensity minutes?**

Based on the [National Health Interview Survey](https://www.healthypeople.gov/2020/topics-objectives/topic/physical-activity/national-snapshot) and [CDC: Physical Activity](https://www.cdc.gov/physicalactivity/basics/adults/index.htm), healthy targets minutes of weekly exercise by different intensities for adults are:

  * 150+ minutes/week - light or moderate activity
  * 75+ minutes/week - vigorous activity (or, very active)

I analyzed minutes per week by intensity by first converting the dates to week number and week dates. Then, I filtered daily activity data for a three week period between Sunday, April 17, 2016 and Saturday, May 07, 2016. 

I grouped the data by user Id2 and week and calculated the sums of distances, calories, and minutes. To match the target rates, I combined light and moderate distances into a separate column and did the same with minutes.

To group dates into week numbers and week dates, I referenced the [aweek package](https://cran.r-project.org/web/packages/aweek/aweek.pdf). I converted each week number into the week date of the Sunday of that week, marking the week's starting date.


```{r distance & minutes by activity type }

#View(d.activity)

# Convert date to week number
ow <- set_week_start("Sunday") # set start day of week

# function for extracting week number from date and converting to week date
fn.wkNumber <- function(dt) {
dt$week <- date2week(dt$date)
dt$week2 <- sub(".$", "1", dt$week) 
dt$week_date <- week2date(dt$week2)  # convert week to Sunday date
return(dt)
}

# apply week number / week date function to d.activity
d.activity <- fn.wkNumber(d.activity)

# filter data for three week period
d.activity_fil <- d.activity %>% filter(between(date, as.Date("2016-04-17"), as.Date("2016-05-07")))
#View(d.activity_fil)

d.activity_fil %>% group_by(week_date) %>% 
  summarize(n()) %>% glimpse()  # 3 weeks

# Sum distances, minutes, calories by week and Id2. Combine Light and Moderate into single column (FairlyActiveMinutes is equivalent to Moderate).
wkTotals.activity <- d.activity_fil %>%
  group_by(week_date, Id2) %>% 
  summarize(wkDistance = sum(TotalDistance), wkTrackerDistance = sum(TrackerDistance), wkSteps = sum(TotalSteps),
            wkCalories = sum(Calories), VeryActiveDistance = sum(VeryActiveDistance), LightModerateDistance = sum(LightActiveDistance) + sum(ModeratelyActiveDistance), 
            VeryActiveMinutes = sum(VeryActiveMinutes),
            LightModerateMinutes = sum(LightlyActiveMinutes) + sum(FairlyActiveMinutes))

# determine if week minute totals meet healthy exercise targets
wkTotals.activity <- wkTotals.activity %>% 
  mutate(VeryMin_target = case_when(VeryActiveMinutes <= 75 ~ "no", VeryActiveMinutes >=75 ~ "yes"),
         LightModMin_target = case_when(LightModerateMinutes < 150 ~ "no", LightModerateMinutes >=150 ~ "yes"))

#View(wkTotals.activity)

# determine % of weeks each user is consistent
wkTargets.activity <- wkTotals.activity %>% group_by(Id2) %>% 
  summarize(LM_target = sum(LightModMin_target == "yes"),
            V_target = sum(VeryMin_target == "yes")) 

#View(wkTargets.activity)

# group by light-moderate and very active targets, calculate count and percents, then bind separate data sets together
numUsers.LM <- wkTargets.activity %>% group_by(LM_target) %>% 
  summarize(count_users = n()) %>% mutate(pcts_users = round((count_users / 32), digits=2), activity = "light or moderate")

numUsers.LM <- rename(numUsers.LM, wk_count = "LM_target")
#View(numUsers.LM)

numUsers.V <- wkTargets.activity %>% group_by(V_target) %>% 
  summarize(count_users = n()) %>% mutate(pcts_users = round((count_users / 32), digits=2), activity = "very")


numUsers.V <- rename(numUsers.V, wk_count = "V_target")

numUsers.wkActivity <- bind_rows(numUsers.LM, numUsers.V)
#View(numUsers.wkActivity)
```



#### Share -- Weekly Activity Target Times
In the bar graph, over three quarters of users achieved the weekly goal of 150+ minutes of light-to-moderate exercise per week for all three weeks. Around 30% of users exercised more than 75+ minutes per week for all three weeks, while about 30% of users had 0 weeks.

Based on this sample, most people meet the weekly exercise recommendations for light-to-moderate exercise, while very active exercise is more variable among users.
```{r plot weekly activity}
# visualize percentage results in bar grpah, facet wrap by activity type
bar.wkActivity <- numUsers.wkActivity %>% 
  ggplot(aes(x = wk_count, y = pcts_users)) +
  geom_bar(aes(fill=activity), stat="identity") + 
  geom_text(aes(label = count_users), vjust=1.3) +
  ggtitle("In three weeks: most users met low-moderate activity goals, \n about a third met very active goals") +
  scale_y_continuous(labels=scales::label_percent(accuracy=1)) +
  scale_fill_manual(values = c("palegreen3", "sienna2"), labels= c("low or moderate (>=150 min)", "very (>=75 min)")) +
  labs(subtitle = "Fitbit minute intensity data for three weeks (Sun, April 17 to Sat, May 07, 2016)", x = "weeks", y = "users (32 total)", caption = "Labels on bars are number of users \nSource: Furberg et al. 2016. Zenodo.", fill = "weekly activity goals") +
  theme_minimal() +
  theme(plot.caption.position="plot",
        plot.title= element_text(face="bold")) +
  facet_wrap(~activity) +
  coord_cartesian(ylim = c(0, 1))
  
bar.wkActivity

# ggsave("./plots/bar.wkActivity.png" , width = 8, height = 5, dpi=300)

```

### Analyze -- Intensity Minute Trends 

**3. Minutes of Different Intensity Exercise and Calories**

Building on step 2, I grouped users by whether or not they had a majority of weeks (at least 2 out of 3 weeks) of light-moderate exercise and/or very-active exercise. For instance, LMV means that user exercised enough for at least 2 weeks, meeting both light-moderate and very active exercise weekly target minutes.

On adding regression lines to scatter plots, I referred to "Basic R Guide for NSC Statistics" by Deanna Li, [Chapter 19: Scatterplots and Best Fit Lines - Two Sets](https://bookdown.org/dli/rguide/scatterplots-and-best-fit-lines-two-sets.html).

```{r group users by intensity}
#View(wkTargets.activity)

userTypes.intensity <- wkTargets.activity %>% 
  mutate(user_type = case_when(LM_target >= 2 & V_target >= 2 ~ "LMV",
                               LM_target < 2 & V_target < 2 ~ "not active",
                               LM_target >= 2 & V_target < 2 ~ "LM",
                               LM_target < 2 & V_target >= 2 ~ "V"))

userTypes <- userTypes.intensity %>% select(Id2, user_type)

userTypes %>% group_by(user_type) %>% summarize(count = n()) %>% glimpse()
#View(userTypes)


# left join wkTotals.activity and userTypes
wkTotals.activity <- merge(x = wkTotals.activity, y = userTypes, by="Id2", all.x=TRUE)

wkTotals.activity <- wkTotals.activity %>% mutate(TotalActiveMinutes = VeryActiveMinutes + LightModerateMinutes)

#View(wkTotals.activity)

```



#### Share -- Weekly Intensity Minute Trends
```{r plot weekly intensity minutes}
# overview of users and intensity minutes and calories
scatter.wkMinutes <- wkTotals.activity %>% 
  ggplot(aes(x=LightModerateMinutes, y = VeryActiveMinutes, color = user_type)) +
  geom_point(aes(size = wkCalories, alpha =wkCalories)) +
  geom_hline(yintercept = 75, linetype="dashed", color="red") +
  geom_vline(xintercept = 150, linetype = "dashed", color = "red") +
  ggtitle("Weekly very active vs. light-or-moderate exercise, and calories burned") +
  scale_color_manual(values = c("palegreen3", "sienna2", "lightgoldenrod2"), labels= c("low or moderate", "low, moderate, or very", "not active")) +
  scale_x_continuous(breaks=seq(0, 2700, 400)) +
  labs(subtitle = "Fitbit minute intensity data for three weeks (Sun, April 17 to Sat, May 07, 2016)", x = "light-moderate active minutes", y= "very active minutes", size = "calories", alpha = "calories", color = "type of exerciser", caption = "Dotted lines are weekly activity goals \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title= element_text(face="bold"))
scatter.wkMinutes

# ggsave("./plots/scatter.wkMinutes.png" , width = 8, height = 5, dpi=300)

# plotting weekly distance and minutes by user type, filter out "not active" users
scatter.wkDistanceMinutes <- wkTotals.activity %>% filter(user_type != "not active") %>% 
  ggplot(aes(x=TotalActiveMinutes, y = wkDistance)) +
  geom_point(aes(color = user_type, alpha = wkCalories, size = wkCalories)) +
  geom_smooth(method="lm", se = FALSE, aes(color = user_type), alpha = 0.2) +
  scale_color_manual(values = c("palegreen3", "sienna2"), labels= c("low or moderate", "low, moderate, or very")) +
  ggtitle("More Calories Burned with More Weekly Exercise") +
  labs(subtitle="Very Active Exercisers Achieve Farther Distances in Less Time", x = "total active minutes", y = "total week distance, km", size = "week calories", alpha = "week calories", color = "type of exerciser", caption = "Fitbit data for three weeks (Sun, April 17 to Sat, May 07, 2016) \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title= element_text(face="bold"))
scatter.wkDistanceMinutes

# ggsave("./plots/scatter.wkDistanceMinutes.png" , width = 8, height = 5, dpi=300)


```

The first scatter plot shows each weekly observation by the minutes of light-moderate active minutes and very active minutes, with the calories represented as size of the data point. Colors represent the type of exercise that each user frequently meets target weekly minutes for.

In the second scatter plot, very active users (LMV) spend more total active minutes and distance exercising per week than (LM) users who aren't very intense exercisers. More weekly active minutes is positively correlated with weekly distance, and both are correlated with greater expense of weekly calories.

##### Key Findings -- Daily Activity

* It would be helpful for Bellabeat users to have a graph summary of minutes of light/moderate exercise and very active exercise, including daily breakdown and weekly totals. Including a weekly target time goal reminder of both exercise intensities compared to users' weekly totals could encourage people to improve or maintain their exercise activity on a daily and weekly basis.

* Based on the Fitbit data, a trend is fewer exercisers who meet very active weekly goals, whereas achieving light-to-moderate target times per week is relatively more common among the user population. But people who do spend more time exercising very actively are able to lose more calories in a shorter distance and time than people who consistently perform light-to-moderate activity. This means more intense workouts can be a time saver with added health benefit. Providing this information to users could help them set overall exercise goals.


### Hourly Activity Data
####  Prepare -- Merge hourly activity datasets

**1. Merging and filtering steps, calories, intensity [sci] data**

Looking at the other data sets beyond daily activity, I noticed I could combine the data in hourly calories, hourly steps, and hourly intensity into one data frame. I then filtered the data to only hours when people are regularly moving and exercising, between 6 AM and 9 PM, and again, for dates between a three week period, Sunday, April 17, 2016 and Saturday, May 07, 2016.

```{r hours of exercising}
#View(h.calories)
#View(h.steps)
#View(h.intensities)

fn.countId <- function(dt) {
  dt <- dt %>% count(Id)
  return(glimpse(dt))
}

fn.countId(h.calories) %>% glimpse() #33

#View(sc)
# steps + calories
sc <- merge(h.steps, h.calories, by=c("Id","ActivityHour"))
# steps + calories + intensities
sci <- merge(sc, h.intensities, by=c("Id","ActivityHour"))
sci <- merge(sci, Id.new, by="Id", all.x=TRUE) # add Id2
#View(sci)


# set timezone
Sys.setenv(tz="America/New_York")
Sys.timezone()


# convert ActivityHour to datetime format
sci$datetime <-strptime(sci$ActivityHour, "%m/%d/%Y %I:%M:%S %p")

# extract hour of day from time, isolate hour
fn.extractHr <- function(dt){
  dt$date <- as.Date(dt$datetime, format="%m/%d/%Y")
  dt$time <- as.POSIXct(dt$datetime, format="%H:%M %p")
  dt$hour <- format(dt$time, format="%I %p")
  dt$hour_min <- format(dt$time, format="%I:%M %p")
  return(dt)
}


sci <- fn.extractHr(sci)
#View(sci)
sci$Day <- weekdays(sci$date)

# range of hours when most people are moving (not sleeping)
amHrs <- paste(6:9, "AM", sep=" ")
amHrs <- paste0(0, amHrs)
midHrs <- paste(10:11, "AM", sep=" ")
pmHrs <- paste(1:9, "PM", sep=" ")
pmHrs <- paste0(0, pmHrs)
movingHrs <- c(amHrs, midHrs, "12 PM", pmHrs)
#glimpse(movingHrs)

# filter sci data for select hours and filter for dates between 
sci_fil <- sci %>% filter(hour %in% movingHrs) %>% filter(between(date, as.Date("2016-04-17"), as.Date("2016-05-07")))
#View(sci_fil)


# examine Id2 in sci_fil - Does it match users in userTypes? YES
Id.sci_fil <- sci_fil %>% group_by(Id2) %>% summarize(count = n()) # 32 users
#View(Id.sci_fil)
#View(userTypes)

testIdmatch <- merge(x=Id.sci_fil, y = userTypes, by="Id2", all.x = TRUE )
#View(testIdmatch)

# merge sci with userTypes
sci_fil <- merge(x=sci_fil, y=userTypes, by="Id2", all.x = TRUE )
```

#### Analyze -- Group data by hour and day

**2. What hours of each day are people exercising?**

After prepping the combined hourly steps, calories, and intensity [sci_fil] for analysis, I analyzed activity trends by hour of day to see when users were exercising the most and found mean averages for each measurement.


```{r analyze sci}
#View(sci_fil)

fn.hourDay.avgs <- function(dt) {
  dt <- dt %>% group_by(hour, Day) %>% 
    summarize(avgStepTotal = mean(StepTotal), avgCalories = mean(Calories),
            avgTotalIntensity = mean(TotalIntensity)) %>%
  mutate(across(where(is.numeric), round))
  return(dt)
}
  
sciAvgs.DayHour <- fn.hourDay.avgs(sci_fil)
#View(sciAvgs.DayHour)


```



#### Share -- Intensity heatmaps by hour and day of week

I plotted the hourly activity data [sci], using different heatmaps. When making the heatmaps, I referenced Yan Holtz's [the R Graph Gallery's heatmap tutorial](https://www.r-graph-gallery.com/79-levelplot-with-ggplot2.html), as well as the [post on RColorBrewer's palettes](https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html), which was helpful in selecting the colors for the heatmaps.

```{r plot sci analysis, fig.width=8.5}

# Overall average total intensities by hour and day of the week
heat.DayHour.intensity <- sciAvgs.DayHour %>% ggplot(aes(x = hour, y=Day, fill=avgTotalIntensity)) +
  geom_tile() +
  ggtitle("On Average, More Intense Activity on Wednesdays after 5 p.m., Saturday 1 p.m., Tuesdays at Noon") +
  labs(fill = "average intensity", subtitle="More intense workouts after 5 p.m. in earlier weekdays of the week (Mon.- Wed.)", caption = "Fitbit data for three weeks (Sun, April 17 to Sat, May 07, 2016) \nSource: Furberg et al. 2016. Zenodo.", y="day") +
  scale_x_discrete(limits= movingHrs) +
  scale_y_discrete(limits= rev(li.daysofwk)) +
  scale_fill_distiller(palette="Oranges", direction =1) +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title.position = "plot",
        plot.title= element_text(face="bold"))

heat.DayHour.intensity

# ggsave("./plots/heat.DayHour.intensity.png" , width = 10.5 , height = 7, dpi=300)

# Overall average total steps by hour and day of the week
heat.DayHour.steps <- sciAvgs.DayHour %>% ggplot(aes(x = hour, y=Day, fill=avgStepTotal)) +
  geom_tile() +
  ggtitle("On Average, More Steps on Wednesdays 5-6 p.m., Saturday 1-2 p.m., and Tuesdays at Noon") +
  labs(fill = "average steps", subtitle="More intense workouts after 5 p.m. on earlier weekdays of the week (Mon.-Wed.)", caption = "Fitbit data for three weeks (Sun, April 17 to Sat, May 07, 2016) \nSource: Furberg et al. 2016. Zenodo.", y="day") +
  scale_x_discrete(limits= movingHrs) +
  scale_y_discrete(limits= rev(li.daysofwk)) +
  scale_fill_distiller(palette="GnBu", direction =1) +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title.position="plot",
        plot.title= element_text(face="bold"))
heat.DayHour.steps

# ggsave("./plots/heat.DayHour.steps.png" , width = 10.5 , height = 7, dpi=300)

# separate data by userType 
avgsByHours_LM <- sci_fil %>% filter(user_type == "LM") %>% fn.hourDay.avgs()
#View(avgsByHours_LM)

avgsByHours_LMV <- sci_fil %>% filter(user_type == "LMV") %>% fn.hourDay.avgs()
#View(avgsByHours_LMV)


# for frequent light-moderate exercisers only - average total intensities by hour and day of week
heat.LM.intensity <- avgsByHours_LM %>% 
  ggplot(aes(x = hour, y = Day, fill=avgTotalIntensity)) +
  geom_tile() +
  ggtitle("On Average for Light/Moderate Exercisers: More Intense Activity on Weekends, and \nWeekdays After 6 p.m.") +
  labs(fill = "average intensity", subtitle="Monday, Tuesday, and Wednesday show more intense activity, of the weekdays", caption = "Fitbit data for three weeks (Sun, April 17 to Sat, May 07, 2016) \nSource: Furberg et al. 2016. Zenodo.", y="day") +
  scale_y_discrete(limits= rev(li.daysofwk)) +
  scale_x_discrete(limits=movingHrs) +
  scale_fill_distiller(palette="YlGn", direction =1) +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title.position = "plot",
        plot.title= element_text(face="bold"))
heat.LM.intensity

# ggsave("./plots/heat.LM.intensity.png", width = 10.5 , height = 7, dpi=300)

# frequent light-moderate-very active exercisers - average total intensities by hour and day of week
heat.LMV.intensity <- avgsByHours_LMV %>% 
  ggplot(aes(x = hour, y = Day, fill=avgTotalIntensity)) +
  geom_tile() +
  ggtitle("On Average for Very Active Exercisers: More Activity After 5 p.m., Monday through Wednesday") +
  labs(fill = "average intensity", subtitle="Monday, Tuesday, and Saturday hours around noon and Sunday nights also have more intense activity", caption = "Fitbit data for three weeks (Sun, April 17 to Sat, May 07, 2016) \nSource: Furberg et al. 2016. Zenodo.", y="day") +
  scale_y_discrete(limits= rev(li.daysofwk)) +
  scale_x_discrete(limits=movingHrs) +
  scale_fill_distiller(palette="OrRd", direction =1) +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title.position = "plot",
        plot.title = element_text(face="bold"))
heat.LMV.intensity

# ggsave("./plots/heat.LMV.intensity.png" , width = 10.5 , height = 7, dpi=300)


```

#### Key Findings -- Hourly Activity

Based on the Fitbit hourly exercise heatmaps, I would suggest the following to improve Bellabeat Time:

* Provide a questionnaire for users in the tracking app to ask what kind of exercise intensity they intend on doing on a weekly basis (very, moderate, or light activity) and suggest optimal exercise hours based on hourly trends of Fitbit users, grouped by exercise intensity type.

* As weekdays after 5 p.m. and weekend afternoons are on average more popular intense activity workout times, giving users the option to create an hourly exercise schedule, including setting reminders to exercise at certain times of the day could improve consistency of workouts.



## Sleep Tracking
#### Prepare -- Clean sleep tracking data

**1. Cleaning the sleep data**
The original sleepDay only has data for 24 users. This small sample size, being < 30, is a limitation of the data analysis.

First, I wanted to see how consistently each user logged their sleep minutes over a period of time. Again, I performed the usual preparation steps, assigning Id2 to each Id for easier readability and joining the daily activity data with the sleep data.

When cleaning the data, I found three users with duplicate data for the same dates. I removed these duplicate rows before plotting the data.


```{r sleep-analysis}

# merge sleepDay and d.activity
#View(sleepDay)
#View(d.activity)
sleepDay$date <- as.Date(sleepDay$SleepDay, format="%m/%d/%Y")
sleepDay <- merge(x=sleepDay, y=Id.new, by="Id", all.x = TRUE )
#glimpse(sleepDay) # 413 rows

sleepDay_sel <- sleepDay %>% select(TotalSleepRecords: Id2) 

sleepId <- sleepDay_sel %>% group_by(Id2) %>% summarize(n()) %>% glimpse()
#View(sleepId)

sleepDates <- sleepDay_sel %>% group_by(date) %>% summarize(n()) %>% glimpse()
#View(sleepDates)

# check sleep data for count by date and user Id2
sleepTest <- sleepDay_sel  %>% group_by(date, Id2) %>% summarize(count = n()) %>% arrange(desc(count))
#View(sleepTest)

## Id2: P, S, ZD each have 2 entries for same date (duplicate data)
sleepTest2 <- sleepDay_sel %>% filter( Id2 == "ZD" & date == as.Date("2016-04-25") | Id2 == "P" & date == as.Date("2016-05-05") |
                                          Id2 == "S" & date== as.Date("2016-05-07"))
# View(sleepTest2) 

# remove duplicated rows in sleep data
sleepDay_sel <- sleepDay_sel %>% distinct() %>% glimpse() # 410 rows

# merge sleep and activity data
sleepActivity <- merge(x=sleepDay_sel, y=d.activity, by=c("Id2", "date"), all.x=TRUE)

# add columns for total minutes of exercise
sleepActivity <- sleepActivity %>% mutate(LightModMinutes = (LightlyActiveMinutes + FairlyActiveMinutes)) %>% 
  mutate(TotalActiveMinutes = (VeryActiveMinutes + LightModMinutes))

#View(sleepActivity)
# add exercise intensity type of each user
sleepActivity.userTypes <- merge(x = sleepActivity, y = userTypes, by = "Id2", all.x=TRUE)

#(sleepActivity.userTypes)

```


### Daily Sleep Records
#### Analyze -- Consistency of sleep records per user

Again, I filtered the daily sleep data to three weeks. I wanted to see how consistent users were recording their sleep time, so I counted the total number of date entries for each user.

```{r analyze sleep activity}
# filter data for three week period
sleepActivity.userTypes_fil <- sleepActivity.userTypes %>% filter(between(date, as.Date("2016-04-17"), as.Date("2016-05-07")))

#View(sleepActivity.userTypes_fil)
# How consistent is the data?
## How many days each user logged sleep data?
sleepUsers <- sleepActivity.userTypes_fil %>% group_by(Id2) %>% summarize(num_days=n()) %>% glimpse() # 23 users
#View(sleepUsers)

## count how many users logged how many days
sleep.dateCount <- sleepUsers %>% ungroup() %>% group_by(num_days) %>% summarize(user_count=n())

userCount <- sleep.dateCount %>% summarize(totalUsers = sum(user_count))
#glimpse(userCount)

```


#### Share -- Visualize user percentage by number of days

The bar graph shows the proportion of 23 users that logged their sleep data for how many days of the selected three weeks. Only 4 users (around 18%) logged sleep minutes for all 21 days.
```{r plot sleepanalysis, fig.width=9, fig.width=7}
# plot to show percentage of users and record of sleep data over time

bar.sleep.dates <- sleep.dateCount %>% ggplot(aes(x=num_days, y = user_count/ sum(user_count))) + 
  geom_bar(stat="identity", fill = "slategray2") +
  geom_text(aes(label = user_count), vjust=-.25) +
  ggtitle("Fewer than 20% of users recorded sleep data daily for 3 weeks") +
  labs( x = "days", y = "users (23 total)", subtitle="Fitbit sleep tracking data for April 17 through May 7, 2016", caption="Bar labels are number of users \nSource: Furberg et al. 2016. Zenodo.") +
  scale_x_continuous(breaks=seq(1, 21, 2)) +
  scale_y_continuous(labels=scales::label_percent(accuracy=1)) +
  coord_cartesian(ylim=c(0, 1)) +
  theme_minimal() +
  theme(plot.title = element_text(face="bold"))

bar.sleep.dates

# ggsave("./plots/bar.sleep.dates.png" , width = 9 , height = 7, dpi=300)


```



Bar graph references:

* Thomas Neitmann's [blogpost about percentage scales](https://thomasadventure.blog/posts/ggplot2-percentage-scale/) was a helpful explainer on how to use the scales package. Note that percent() and percent_format() are retired, according to the [scales package documentation](https://cran.r-project.org/web/packages/scales/scales.pdf), so I used label_percent() instead.

* To adjust the y-axis scale limits to show 0 to 100%, I used coord_cartesian(). To read more about this, see roelpi's [blog post](https://www.roelpeters.be/add-percentages-to-your-axes-in-rs-ggplot2-and-set-the-limits/) about adding percentage limits using ggplot2

* I referred to a [Stack Overflow discussion](https://stackoverflow.com/questions/12018499/how-to-put-labels-over-geom-bar-for-each-bar-in-r-with-ggplot2) for adding labels to bars in ggplot2, namely the response from user rcs to adjust the placement using vjust.

#### Key Findings -- Sleep Records

The lack of consistent daily sleep logging among Fitbit users shows a problematic trend. Perhaps users are not wearing their tracking watches while they sleep.
* I suggest that the Bellabeat team create a wearable sleep tracking ring, which some companies like [Oura](https://ouraring.com) have designed. This could act as a companion to the Time watch, and the smaller device might encourage more people to wear a tracker when sleeping.
* Providing daily reminders to wear the watch while sleeping, through the Bellabeat app, could help promote consistency of sleep tracking data collection.

### Healthy Daily Sleep
#### Analyze -- Target sleep day percentages VS. user percentages
**2. How often do Fitbit users get enough daily sleep?**

According to a 2015 study, [Watson et al](http://dx.doi.org/10.5665/sleep.4716), cited by the [CDC](https://www.cdc.gov/sleep/about_sleep/how_much_sleep.html), the healthy amount of sleep for an adult is 7 hours or more a night. Since the sleep data is in minutes, this converts to 420+ minutes a night.

I categorized each day as >= 420 min. or < 420 min. to determine each user's percentage of total days of sleep records. By grouping the data further, I found the total count of users and the percentage of healthy days of the sleep days recorded per user.

Next, I grouped the percentage of healthy sleep data into percent ranges by quarters and recounted the total users per percent range and the respective user percentages. This data is easier to visualize than individual non-grouped healthy sleep day percentages.

```{r sleep and exercise}

# For each user, how often did they get a healthy amount of sleep? 
#View(sleepActivity.userTypes_fil)

healthy.sleepDays <- sleepActivity.userTypes_fil %>%
  mutate(sleepTarget = case_when(TotalMinutesAsleep >= 420 ~ "healthy",
                                 TotalMinutesAsleep < 420 ~ "no"))

## percentages by number of days they recorded sleep data
healthy.sleepDays.byUser <- healthy.sleepDays %>% group_by(Id2) %>% 
  summarize(healthy_sleep_days = sum(sleepTarget == "healthy"),
            total_days = n()) %>%
  mutate(pct_healthy_sleep = round(healthy_sleep_days / total_days, digits=2))

#View(healthy.sleepDays.byUser)

## group by percentages of healthy sleep days to find user counts and percentages
healthy.sleepDays.pctUsers <- healthy.sleepDays.byUser %>% group_by(pct_healthy_sleep) %>% 
  summarize(count_user = n()) %>% 
  mutate(pct_user = round( count_user / sum(count_user), digits=2)) %>% ungroup()
glimpse(healthy.sleepDays.pctUsers)

sum(healthy.sleepDays.pctUsers$count_user) #23 total users


ranges.healthySleepDays.pctUsers <- healthy.sleepDays.pctUsers %>% 
  mutate( ranges_sleepDays = dplyr::case_when( pct_healthy_sleep >= 0.75 ~ "75-100%",
                    pct_healthy_sleep >= 0.50 ~ "50-74%",
                     pct_healthy_sleep >= 0.25 ~ "25-49%",
                    pct_healthy_sleep > 0 ~ "0-24%"))

# replace NA with 0-24%
ranges.healthySleepDays.pctUsers[is.na(ranges.healthySleepDays.pctUsers)] = c("0-24%")       

ranges.healthySleepDays.pctUsers <- ranges.healthySleepDays.pctUsers %>% 
  select(!pct_user) %>% 
  group_by(ranges_sleepDays) %>% 
  summarize(count_user = sum(count_user)) %>% ungroup() %>% 
  mutate(pct_user = round( count_user / sum(count_user), digits=2)) %>% ungroup()
  
#View(ranges.healthySleepDays.pctUsers)
            
```


#### Share -- Visualize target sleep day percentages by users

In the bar chart, fewer than 25% of Fitbit users (5 users) recorded at least 75% days of healthy sleep duration (at least 7 hours a day), while fewer than 25% of users (6 users) had fewer than a quarter days of enough sleep. This suggests that users could use more consistency in the amount of sleep they are getting.

```{r percent sleep days}

bar.pctHealthySleepDays <- ranges.healthySleepDays.pctUsers %>% 
  ggplot(aes(x = ranges_sleepDays, y = pct_user)) +
  geom_bar(stat="identity", fill="lightskyblue") +
  geom_text(aes(label = count_user), vjust = 1.4) +
  ggtitle("Fewer than 1/4 Fitbit users had at least \n75% of healthy sleep days (7+ hours/day)") +
  labs(x = "days of healthy sleep", y="users (23 total)", subtitle="Over 1/4 users had fewer than 1/4 total healthy sleep days (April 17 to May 07, 2016)", caption = "Bar labels are number of users \nSource: Furberg et al. 2016. Zenodo.") +
  scale_y_continuous(labels=scales::label_percent(accuracy=1)) +
  scale_x_discrete() +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title= element_text(face="bold")) +
  coord_cartesian(ylim = c(0, 1))
bar.pctHealthySleepDays 

# ggsave("./plots/bar.pctHealthySleepDays.png" , width = 10, height = 7, dpi=300)



```


#### Key Findings -- Suggestions for More Healthy Sleep Days

As nearly half of the Fitbit users had fewer than 50% healthy sleep days of their total sleep records, another sleep data trend is that most people are not sleeping the healthy recommended amount of 7+ hours per day. 

I suggest that Bellabeat create a function in the app for a user to select target sleep and waking times that reflect a healthy amount of daily sleep. The addition of notification reminders for users to sleep at a certain time could help increase the number of users who are getting enough daily sleep.


### Sleep VS. Exercise Activity
#### Analyze
**If a person, on average, sleeps enough in a week, does that correlate with them spending more time exercising during that same week?**

I wanted to examine:

* Are sleep times and activity times correlated?
* Are sleep times and calories correlated?

Taking the merged sleep and daily activity dataset, I grouped the data by user and week, and found the average values for total sleep for each week and the total weekly activity minutes, steps, calories, as well as the mean weekly calories.

```{r sleep x activity}
# Is there a correlation of healthy daily sleep and exercise activity?
wk.sleepActivity <- sleepActivity.userTypes_fil %>% ungroup() %>% group_by(Id2, week_date) %>% 
  summarize(avgSleepMin = mean(TotalMinutesAsleep),
            medianSleepMin = median(TotalMinutesAsleep),
            wkActiveMin = sum(TotalActiveMinutes),
            wkLightModMin = sum(LightModMinutes),
            wkVeryActiveMin = sum(VeryActiveMinutes),
            totalSteps = sum(TotalSteps),
            totalCalories = sum(Calories),
            avgCalories = mean(Calories))

wk.sleepActivity <- merge(x = wk.sleepActivity, y = userTypes, by="Id2", all.x= TRUE)

# View(wk.sleepActivity)

```



#### Share

For the first scatter plot showing average sleep time and total weekly activity time, I colored the points by type of exerciser, using the user intensity types from the earlier activity analysis. The size of the points reflects the total weekly calories. Both groups of exercisers show a slightly positive correlation of total weekly exercise time and weekly average sleep time.

In the second scatter plot, total weekly calories is also slightly positively correlated with weekly average sleep time, though there are outlier cases.

```{r plot sleep-exercise}
# weekly active minutes vs. average sleep minutes
scatter.wk.sleepActiveMin <- wk.sleepActivity %>% 
  ggplot(aes(x = avgSleepMin, y = wkActiveMin)) +
  geom_point(aes(color = user_type, size=totalCalories)) +
  geom_smooth(method="lm", se = FALSE, alpha = 0.2, aes(color = user_type)) +
  geom_vline(xintercept = 420, linetype = "dashed", color = "red") +
  ggtitle("Weekly Average Sleep and Activity Times Are Positively Correlated") +
  scale_y_continuous(breaks = seq(0, 28000, 4000)) +
  scale_color_manual(values = c("palegreen3", "sienna2", "lightgoldenrod2"), labels= c("low or moderate", "low, moderate, or very", "not active")) +
  labs(subtitle = "Weekly Fitbit data for April 17 to May 7, 2016", x = "average sleep minutes", y = "total weekly active minutes", size = "total week calories", color = "type of exerciser", caption = "Dotted line is healthy sleep per day \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position = "plot",
        plot.title= element_text(face="bold"))
scatter.wk.sleepActiveMin

# ggsave("./plots/scatter.wk.sleepActiveMin.png" , width = 10, height = 7, dpi=300)

# total weekly calories vs. average sleep minutes
scatter.wk.sleepCalories <- wk.sleepActivity %>% 
  ggplot(aes(x = avgSleepMin, y = totalCalories)) +
  geom_point() +
  geom_smooth(method="lm", se = FALSE, color = "slategray2") +
  geom_vline(xintercept = 420, linetype = "dashed", color = "red") +
  ggtitle("Weekly Average Sleep Time and Total Calories Are Positively Correlated") +
  scale_y_continuous(breaks = seq(0, 28000, 4000)) +
  labs(subtitle = "Weekly Fitbit data for April 17 to May 7, 2016", x = "average sleep minutes", y = "total weekly calories", caption = "Dotted line is healthy sleep per day \nSource: Furberg et al. 2016. Zenodo.") +
  theme_minimal() +
  theme(plot.caption.position="plot",
        plot.title= element_text(face="bold"))
scatter.wk.sleepCalories

# ggsave("./plots/scatter.wk.sleepCalories.png" , width = 10, height = 7, dpi=300)

```

#### Key Findings -- Sleep and exercise activity

More sleep is positively correlated with more exercise and calorie burning, by week. If Bellabeat Time users are provided daily reminders to meet sleep targets (7+ hours / day), that could in turn, improve their exercise activity. 


## Summary of Project Analysis

For this case study, I analyzed Fitbit data related to **activity** and **sleep** to examine trends among fitness tracking users, to then inform insights for improving the Bellabeat Time fitness watch, which is designed to track user activity and sleep patterns, among other measures.

### Activity
I analyzed *daily activity* data related to activity intensities (very, moderate, and lightly active), calories, and steps. By determining whether or not users achieved healthy light/moderate or very active times per week and how frequently they met those goals (a majority of weeks being at least 2 of 3 weeks), I found two main groups of exercisers:

* Light/moderate exercisers (but not very active) -- LM
* light/moderate and very active exercisers -- LMV

Then, I looked at *hourly activity* data, using this same grouping of exerciser types. I examined when people most frequently exercised, based on average total intensities by hour and day of the week and determined separate trends for different types of exercisers, along with trends for the overall users.

*Activity Trends*

* Wider distribution of amount of activity on weekends than weekdays
* Of the weekdays, slightly greater median of activity intensity on Tuesdays
* More weekly calories burned in less time and shorter distance for users who frequently are very active, as compared to exercisers who are only frequently light/moderately active
* Overall, more intense workouts occur on Tuesdays at noon, Saturdays at 1 p.m., and after 5 p.m. on weekdays
* Frequent light/moderate exercisers tend to be more active on weekends, and weekdays after 6 p.m.
* Frequent very active exercisers tend to be more active on Monday, Tuesday, and Saturday around noon; and Sunday through Wednesday after 5 p.m.

### Sleep

For the *sleep day* data, I analyzed the following and found these *trends*:

* How often were users recording sleep data?
  + Fewer than 20% of users consistently recorded sleep data over three weeks

* How often were users getting a healthy amount of sleep each day (at least 7 hours), of the total days they recorded sleep data?
  + Fewer than a quarter of users had at least 75% days of healthy sleep
  + Over a quarter of users had only less than 25% days of healthy sleep

* By merging the sleep and activity data and visualizing total weekly active minutes and average sleep minutes per week, there is a slightly positive correlation between sleep and activity for exercisers of either intensity group (LMV or LM).

### Data Limitations

The data used for this analysis was limited to a small sample size (around 30 or fewer users). 

In particular, the sleep dataset had information for fewer than 30 users. The sleep trends found from this data analysis are not a significant representation of a larger population, due to the small sample population and missing data.

As the data collected was for a short time period, between April and May 2016, the health activity and sleep patterns are limited to an optimal season of exercise (spring/fall, depending on the hemisphere location of each user). Activity and sleep trends could vary during other seasons of the year.

Due to the anonymity of user Ids, no information was known about users' biological sex or age, nor work hours, living conditions, and class status, characteristics which could impact a person's exercise and sleep schedules.


### Act -- High-level Recommendations for Stakeholders

The following recommendations are based on trends from the analysis and can be applied to Bellabeat's fitness tracking watch, Time.

**Activity tracking suggestions**

* Provide a questionnaire for users to determine the type of intensity workouts they plan on doing regularly and an option to set weekly exercise goals
* Provide a weekly summary graph to show times of activity intensity per day and weekly totals per type of intensity (light/moderate and very active)
* Include a reminder of optimal intensity exercise goals during the week and the difference of time needed to complete that weekly goal, based on exercise intensity time already spent
* Give users the option to create hourly exercise schedules with reminder notifications
* Suggest optimal exercise hours based on the type of exercise they intend on doing, as answered in their questionnaire

**Sleep tracking suggestions**

* Create an optional sleeping tracking ring to act as a companion to the Time watch. The ring is a smaller wearable option that could encourage users to wear a tracking device daily when sleeping.
* Provide the option of daily notifications to remind users to wear the watch while sleeping
* Allow users to select optimal sleep and waking times and set reminders to go to bed by that certain time
* Inform users of the benefits of healthy amounts of sleep, correlated with more weekly exercise


## References

The following references are organized into sections by order of appearance.

#### Data, Background Information

1. Furberg, R., Brinton, J., Keating, M., & Ortiz, A. (2016). *Crowd-sourced Fitbit datasets 03.12.2016-05.12.2016*. [Data set]. Zenodo. https://doi.org/10.5281/zenodo.53894. Retrieved from https://www.kaggle.com/arashnic/fitbit.
2. Fitabase. (2018). *Fitabase Data Dictionary*. Retrieved from https://www.fitabase.com/media/1930/fitabasedatadictionary102320.pdf
3. National Health Interview Survey (NHIS), Centers for Disease Control and Prevention (CDC) / National Center for Health Statistics (NCHS). (2022). *Physical Activity.* Retrieved from https://www.healthypeople.gov/2020/topics-objectives/topic/physical-activity/national-snapshot.
4. CDC. (2022) *Physical Activity*. Retrieved from https://www.cdc.gov/physicalactivity/basics/adults/index.htm.
5. Watson NF, Badr MS, Belenky G, et al. Recommended amount of sleep for a healthy adult: a joint consensus statement of the American Academy of Sleep Medicine and Sleep Research Society. Sleep. 2015;38(6):843–844. http://dx.doi.org/10.5665/sleep.4716.


#### Coding Resources

1. Soni, Yash. (2018) *How I analyzed the data from my Fitbit to improve my overall health. freeCodeCamp*. https://www.freecodecamp.org/news/how-i-analyzed-the-data-from-my-fitbit-to-improve-my-overall-health-a2e36426d8f9/.
2. Kassambara, Alboukadel. *ggcorrplot: Visualization of a correlation matrix using ggplot2* (0.1.3). https://rpkgs.datanovia.com/ggcorrplot/.
3. Holtz, Yan. (2018) *Violin plot with included boxplot and sample size in ggplot2*. The R Graph Gallery. https://www.r-graph-gallery.com/violin_and_boxplot_ggplot2.html.
4. Kamvar, Zhian N. (2021) *Package aweek: Convert Dates to Arbitrary Week Definitions*(1.02). Retrieved from https://cran.r-project.org/web/packages/aweek/aweek.pdf.
5. Li, Deanna. (2020) *Basic R Guide for NSC Statistics, Chapter 19: Scatterplots and Best Fit Lines - Two Sets*. (Bookdown) https://bookdown.org/dli/rguide/scatterplots-and-best-fit-lines-two-sets.html.
6. Holtz, Yan. (2018). *ggplot2 heatmap*. The R Graph Gallery. https://www.r-graph-gallery.com/79-levelplot-with-ggplot2.html.
7. Holtz, Yan. (2018). *R Color Brewer's Palettes*. https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html.
8. Neitmann, Thomas. (2020) *Transform a ggplot2 Axis to a Percentage Scale*. https://thomasadventure.blog/posts/ggplot2-percentage-scale/.
9. Wickham, Hadley and Seidel, Dana. (2020) *Package scales: Scale Functions for Visualization*(1.1.1). https://cran.r-project.org/web/packages/scales/scales.pdf.
10. roelpeters.be. (2019) *Add percentages to your axes in R's ggplot2 (and set the limits)*. https://www.roelpeters.be/add-percentages-to-your-axes-in-rs-ggplot2-and-set-the-limits/.
11. rcs. (2012) *How to put labels over geom_bar for each bar in R with ggplot2*. Stack Overflow. https://stackoverflow.com/questions/12018499/how-to-put-labels-over-geom-bar-for-each-bar-in-r-with-ggplot2.

